Back in my first data job, I wasted a month trying to clean a messy dataset by hand.

The file was from a hospital in Portland and had thousands of duplicate patient entries. I was about to give up when a senior dev showed me a simple fuzzy matching script using Python's difflib library. It took an afternoon to write and matched entries with 95% accuracy, something my manual checks never could. Anyone have a better tool for messy real-world data like that?

3 comments

3 Comments

graygonzalez2mo ago

95% accuracy" sounds good until you realize that's 5% wrong matches, which is a huge deal with patient data. Manual checking might be slow, but it's way safer for something that important.

patricia_mason2mo ago

Yeah, that's a really good point from @graygonzalez. Five percent wrong means one in twenty people get their medical info mixed up. That's not just a number, that's someone's treatment getting messed up. I'd rather wait longer for a manual check than risk that kind of mistake.

claire_walker2mo ago

Yeah I actually thought the same way you did before. The 95% stat sounded fine until I really thought about one in twenty people getting wrong medical records. That changed my mind too.