Best practice for fuzzy legislator name matching?


#1

I’m working on code to resolve committee members, to improve the organization records in openstates/people. I need to convert people’s names to their OpenStates OCD IDs. However, I know that in many cases, names are not written the same way in the committee memberships as they are for the people themselves. Anyone who’s worked on scrapers is familiar with the problem, in this context, or in recording roll call votes, or other contexts.

What is current best practice for fuzzy matching of legislator names? I’m hoping there’s a piece of code somewhere that, given a name “Doe” and a list of legislator names, returns one of:

  • it’s definitely Jane Doe
  • it’s probably Jane Doe
  • it’s probably missing from the list
  • it’s ambiguous; there are legislators Jane Doe and John Doe, and I can’t tell which.

If anyone’s intrigued from a research perspective, a Google search for

“edit distance” “proper name”

yields some papers.


#2

We don’t rely on fuzzy matching at all anymore, instead we’d planned to start mapping the unmapped names (since they don’t occur that often) in the legislators repo as “other_names” so that they can be matched once and then never again.

I do maintain a library for fuzzy matching (jamesturk/jellyfish on GitHub) that we used to use, but I think we’d be better off with the types of names we need to match doing the big manual match.

My unplanned hiatus from work on the project will hopefully be coming to an end soon, and this is one of the biggest things on my plate


#3

they don’t occur that often

I don’t have hard numbers at hand, but I think my mileage differs, particularly when roll call votes are considered. I’m not arguing against using a legislator repo, but I think more tooling and support, including automated help with matching, will be needed to make it sufficiently easy to maintain well enough to support things like roll calls and committee membership in states where the names are inconsistent with their canonical form.

I have some pretty well-baked ideas about a better way to do fuzzy matching in this context, but I need to finish writing and testing before pushing it. There are two aspects:

  • “edit distance” in this domain is not well represented by the standard edit distance metrics. For instance, first initials and first/last name reversal are costed too high.
  • the specifics of matching a group of legislators against a sometimes-larger reference group, with aliases and possibly missing reference members, requires a permutation-walker that’s tuned for these quirks.