I know one or two of you have already noticed it on GitHub, but we’re getting to the point where I’d love to get some feedback on a new idea we have been playing with that should go far to address some of the most common issues we’ve had with ensuring we have reliable data for legislators (and committees hopefully).
Right now the system is set up to scrape legislators/committees nightly and makes some decisions that I know have baffled people newer to the project. For a bit of history - after much trial & error we decided we needed to err on the side of keeping people around when they disappeared from a state site, there are just too many false positives to retire people automatically. Additionally, we don’t automatically merge people that are “similar” because there are enough cases where there are two different people, and once data is merged it is pretty tough to undo.
Of course, all of those decisions were made when we had data quality staff, but we’re operating as a volunteer project now- and we aren’t able to keep up with these things the way we were years ago.
So we’re hoping to try something new: Instead of scraping legislators nightly, we’ll scrape much less frequently (essentially after elections), and have the canonical source of legislators be flat files that can be updated by this scrape. This means people would be free to contribute corrections, additional contact information that we can’t easily scrape, and retire/merge people as needed with a simple PR. (Additionally, this means more attention can be paid to maintaining scrapers that actually do need to run regularly, bills & votes.)
I’ve taken some time in recent weeks to prototype this, and I’d love to start getting some feedback. The repo is here https://github.com/openstates/people and you’ll also notice some issues marked “help wanted” that I’d love to see feedback on.
Of particular interest would be thoughts on the schema and overall premise, there are a few issues that . Please don’t give feedback on the actual data yet, as what is there is all test data as we nail down the schema/tools/etc. Once we’ve decided to move forward with this, we’ll generate legislator files for every currently serving legislator and make a follow-up announcement, at which point we’ll be very glad to have those PRs/etc.