Attribute extraction — automatically predicting colour, material, fit, and style from a product image — is one of the highest-value applications of computer vision in retail. Getting the training data right is 80% of the work.
Why Attribute Models Need Specialised Data
Generic ImageNet-pretrained models know what a "shirt" is. They do not know that this shirt is slim-fit, cotton-blend, and navy-striped. To learn those distinctions a model needs thousands of labelled examples per attribute per category — a much larger and more structured dataset than category classification requires.
Our retail AI dataset is purpose-built for this: each image is tagged with up to 12 attributes drawn from a controlled vocabulary, covering 8 top-level product categories.
Designing a Multi-Label Attribute Schema
Common taxonomies use a two-level hierarchy:
Category: Tops → Attribute group: Fit → Values: [slim, regular, oversized]
Category: Tops → Attribute group: Neckline → Values: [crew, v-neck, turtleneck]
Each image carries one value per attribute group (or null if not applicable). Use JSONL with one object per image to avoid the column explosion of flat CSV when attribute count grows beyond 20.
Handling Class Imbalance in Attribute Data
Attribute distributions are more skewed than category distributions. "Black" typically outweighs "yellow" by 15:1 in fashion datasets. Approaches that work:
- Focal loss during training (no data changes required)
- Stratified sampling at batch level
- Targeted data acquisition — request specific attribute/category combinations from ImageHub's custom catalog
Combining Multiple Data Sources
Mixing data from multiple retailers improves attribute model robustness but introduces domain shift. Mitigate this by:
- Normalising the annotation vocabulary across sources (map synonyms to canonical values)
- Recording
source_domainin metadata and using domain-adaptation fine-tuning if needed - Holding out one retailer's data as a cross-domain validation set
Getting Started
Download a free ImageHub sample to inspect our attribute schema and annotation coverage. For production-scale attribute training data, contact us for a custom order.