Trick for cleaning up messy AI training data without losing your mind
I was trying to train a simple image classifier for my side project, and the dataset was a total mess. Duplicates, wrong labels, blurry shots - the usual garbage. I spent three hours manually sorting through 2,000 photos and wanted to throw my laptop out the window. Then I tried using a basic clustering algorithm to group similar images together before cleaning. It took maybe 30 minutes to set up in Python, and it flagged 400 exact duplicates and 150 near-duplicates I never would have caught by eye. The model accuracy jumped from 62% to 81% after I removed the junk. Has anyone else tried this approach or found a faster way to prep datasets?