21
Trick for cleaning up messy AI training data without losing your mind
I was trying to train a simple image classifier for my side project, and the dataset was a total mess. Duplicates, wrong labels, blurry shots - the usual garbage. I spent three hours manually sorting through 2,000 photos and wanted to throw my laptop out the window. Then I tried using a basic clustering algorithm to group similar images together before cleaning. It took maybe 30 minutes to set up in Python, and it flagged 400 exact duplicates and 150 near-duplicates I never would have caught by eye. The model accuracy jumped from 62% to 81% after I removed the junk. Has anyone else tried this approach or found a faster way to prep datasets?
2 comments
Log in to join the discussion
Log In2 Comments
nora_dixon21d ago
Did that clustering catch the mislabeled ones too or just dupes?
6
thea85721d ago
That "mislabeled ones" part is what I'm curious about too. Did it actually flag records where the label was wrong (like a cat tagged as a dog) or was it more about grouping similar data together regardless of the label? I wonder if you'd need a separate validation step to catch those mislabeling errors, or if the clustering algorithm is smart enough to spot outliers that don't match their own group.
4