A new approach to legislator data


#1

I know one or two of you have already noticed it on GitHub, but we’re getting to the point where I’d love to get some feedback on a new idea we have been playing with that should go far to address some of the most common issues we’ve had with ensuring we have reliable data for legislators (and committees hopefully).

Right now the system is set up to scrape legislators/committees nightly and makes some decisions that I know have baffled people newer to the project. For a bit of history - after much trial & error we decided we needed to err on the side of keeping people around when they disappeared from a state site, there are just too many false positives to retire people automatically. Additionally, we don’t automatically merge people that are “similar” because there are enough cases where there are two different people, and once data is merged it is pretty tough to undo.

Of course, all of those decisions were made when we had data quality staff, but we’re operating as a volunteer project now- and we aren’t able to keep up with these things the way we were years ago.

So we’re hoping to try something new: Instead of scraping legislators nightly, we’ll scrape much less frequently (essentially after elections), and have the canonical source of legislators be flat files that can be updated by this scrape. This means people would be free to contribute corrections, additional contact information that we can’t easily scrape, and retire/merge people as needed with a simple PR. (Additionally, this means more attention can be paid to maintaining scrapers that actually do need to run regularly, bills & votes.)

I’ve taken some time in recent weeks to prototype this, and I’d love to start getting some feedback. The repo is here https://github.com/openstates/people and you’ll also notice some issues marked “help wanted” that I’d love to see feedback on.

Of particular interest would be thoughts on the schema and overall premise, there are a few issues that . Please don’t give feedback on the actual data yet, as what is there is all test data as we nail down the schema/tools/etc. Once we’ve decided to move forward with this, we’ll generate legislator files for every currently serving legislator and make a follow-up announcement, at which point we’ll be very glad to have those PRs/etc.


#2

Hello James! First, let me thank you, and all contributors, for all your hard work on OpenStates. It’s been an invaluable data-source for our work helping our clients enable their members to contact state legislators. Writing and maintaining scrapers is hard work but it has resulted in high-quality, accurate data which is critical for effective advocacy. Thank you!

From our perspective as consumers of state legislator data, the proposed change seems like a regression. It may well be workable to maintain a rarely-updated flat file for national representatives - there are relatively few and it’s national news when there are important changes.

On the other hand, there are so many state representatives - only someone focusing on a particular state would notice when there was a change. This is why the system of leaving retired reps in place has been such a thorn in our side, we only find out when it causes an actual problem for our clients.

Leaving aside changes in actual representation, no longer getting updates on email addresses, phone numbers and office addresses (where we send actual people for protests and petition deliveries) would be unfortunate. I don’t have statistics on the rate of change for this data handy, but I suspect it is quite high and only weakly correlated to a national election cycle.

Maintaining the scrapers and keeping them running is hard work and I am very sympathetic to the desire to run them less often and with more manual attention. Perhaps there’s a middle ground between no automated runs and running them nightly? I fear if this change is made we may have to take running (and thus maintaining) the scrapers in-house which would be a sad duplication of effort.

I’d much rather step-up our contribution to the project and take on some of the labor of maintaining the scrapers so they can continue to run with some frequency. Let me know if you want to talk about that possibility.


#3

Hey @samtregar thanks for jumping in so fast! A few clarifying points – I think maybe you looked at james’ unfinished sample files when he should’ve pointed everyone at the schema.

  1. Contact info isn’t meant to go away.
  2. This is meant to ease things like retirement – someone who finds an issue can just Pull Request in the new person, as opposed to posting it as a bug and hoping someone logs into the openstates system to resolve it. Granted the PR will still need to be approved, but eyeballing a YML change is a much lower hurdle, and potentially opens up the possibility of more maintainers, since we’re not talking about requiring full admin access to openstates.org
  3. Consumers can just do a regular git pull on the repo and build tools to automatically process updated yml files, so in theory your pipeline could be cleaner than constantly pinging the API for changes.
  4. Same for contact data, if a leg changes their email or an address update becomes (un)available it can be PR’d. We’re still working out the balance between how often to scrape these, but at least the frequency can be significantly lower than ‘nightly’.

#4

Can you say more about the anticipated frequency of scraping? The original message says “we’ll scrape much less frequently (essentially after elections).” That strikes me as not often enough to catch changes in contact info for state reps, nor would it account for retirements (which admittedly are not well handled in the current system either) or special elections.


#5

Thanks for the feedback Sam-

I can state from experience that most contact info really doesn’t change too frequently, and when it does we’ve actually seen that official sources in many states lag behind (we’ve asked numerous states to update legislative contact info after having people reach out to us, this would in theory let us move a bit faster there).

As for retirements- As Tim said I’m confident this process will be greatly improved, since Miles & I won’t be major bottlenecks anymore.

Special elections do pose a bit of a challenge, we’d need to either manually add the winner or kick off a scrape, and of course that means things could lag behind. That of course ties into your latter question, as to what the actual scrape frequency will be. I don’t think we have an answer to that yet, right now my thinking has been that a scrape would be something that is done manually since it’ll require a bit of resolution- but perhaps we can figure out a way to scrape more frequently.


#6

RE:Special elections – another thing i’ve been thinking about re:scrape frequency is cross-referencing to third party data to alert us to changes.

Depending on how much complexity we want to add (and who we want to trust) it wouldn’t be hard to automatically check our data against faster sources like google’s civic api, or (if someone had paid access to it) AP’s elections api.

That can at the very least file an issue with “Looks like this person is gone” or “Jane Smith is now in this seat”, which would make our data significantly faster than waiting for state website updates.

Discrepancies of more than a few people could be triggers to kick off a manual scrape and go resolve it wholesale.

(To be clear i’m not advocating for importing their data which is a licensing headache.)

Just for fun i wrote a sample script for this earlier today – https://github.com/openstates/people/issues/14

At least in that case we’d go from “50 scrapers have to stay running every night or they derail the data” to “One api has to stay running”.


#7