Product Image Dataset Best Practices for Machine Learning Teams

Best practices for preparing, labelling, and validating a product image dataset for ML training: class balance, deduplication, annotation schemas, and quality checks.

A poorly prepared product image dataset is one of the most common reasons ML projects stall after promising prototypes. This post covers the practical checklist used by data teams at ImageHub when preparing training batches for clients.

1. Define Your Annotation Schema First

Before you label a single image, answer:

  • What is the primary label per image? (category, product type, attribute?)
  • Are labels mutually exclusive or multi-label?
  • What metadata columns does your training pipeline expect?

Retrofitting a schema onto an already-labelled dataset is expensive and error-prone.

2. Enforce Class Balance

Most real-world product catalogs are heavily skewed — clothing dwarfs jewellery by 20:1. If your model is trained on that raw distribution it will underperform on minority classes. Two strategies:

  • Oversampling — repeat minority class images with augmentation
  • Cap majority classes — set a maximum per-class sample count and sample the rest

Our product image dataset ships with per-category counts in the metadata CSV so you can apply either strategy without an extra inventory pass.

3. Deduplicate Across Sources

Retailer databases share manufacturer images. Running MD5 hashes alone misses near-duplicates (watermark additions, minor crops). Use perceptual hashing (pHash) with a Hamming distance threshold of ≤ 10 to catch near-duplicates before training.

4. Separate Studio vs Lifestyle Photography

Studio shots (white background, single product) and lifestyle shots (models, room settings) have very different feature distributions. Mix them without a flag and your model will confuse background signal with product signal. Always include an image_type column in your metadata.

5. Validate with a Held-Out Human Review

Before shipping a dataset to your training pipeline, pull a random 2% sample and review it by eye. Look for:

  • Wrong category assignments
  • Low-quality images (blurry, clipped products)
  • Annotation inconsistencies across annotators

Tools and Resources

ImageHub publishes structured datasets with JSONL annotations and a companion CSV index. Browse free samples to see the schema in action, or order a custom batch designed around your class requirements.


Explore Our Image Dataset Guides

Browse the full catalog →