Trick for cleaning up messy AI training data without losing your mind

I was trying to train a simple image classifier for my side project, and the dataset was a total mess. Duplicates, wrong labels, blurry shots - the usual garbage. I spent three hours manually sorting through 2,000 photos and wanted to throw my laptop out the window. Then I tried using a basic clustering algorithm to group similar images together before cleaning. It took maybe 30 minutes to set up in Python, and it flagged 400 exact duplicates and 150 near-duplicates I never would have caught by eye. The model accuracy jumped from 62% to 81% after I removed the junk. Has anyone else tried this approach or found a faster way to prep datasets?

3 comments

3 Comments

nora_dixon2mo ago

Did that clustering catch the mislabeled ones too or just dupes?

thea8572mo ago

That "mislabeled ones" part is what I'm curious about too. Did it actually flag records where the label was wrong (like a cat tagged as a dog) or was it more about grouping similar data together regardless of the label? I wonder if you'd need a separate validation step to catch those mislabeling errors, or if the clustering algorithm is smart enough to spot outliers that don't match their own group.

gavinw4528d ago

Clustering caught dupes but for mislabels I had to manually spot check the groups.

-1