12
Back in my first data job, I wasted a month trying to clean a messy dataset by hand.
The file was from a hospital in Portland and had thousands of duplicate patient entries. I was about to give up when a senior dev showed me a simple fuzzy matching script using Python's difflib library. It took an afternoon to write and matched entries with 95% accuracy, something my manual checks never could. Anyone have a better tool for messy real-world data like that?
3 comments
Log in to join the discussion
Log In3 Comments
graygonzalez1mo ago
95% accuracy" sounds good until you realize that's 5% wrong matches, which is a huge deal with patient data. Manual checking might be slow, but it's way safer for something that important.
4
patricia_mason1mo ago
Yeah, that's a really good point from @graygonzalez. Five percent wrong means one in twenty people get their medical info mixed up. That's not just a number, that's someone's treatment getting messed up. I'd rather wait longer for a manual check than risk that kind of mistake.
1
claire_walker17d ago
Yeah I actually thought the same way you did before. The 95% stat sounded fine until I really thought about one in twenty people getting wrong medical records. That changed my mind too.
4