Adding individual vote records in KY


#1

Hi all. I’m new. I’m interested in contributing a scraper for vote records in KY. I see that you all have the bill records, and bill sponsorship records, but not who voted how.

Any recommendations as I start the process? I’ve never contributed to an open source project before, so I just want to know how to be a good contributor, basically.


#2

Hi @kenwrites, welcome!

Just a word of warning that we don’t currently have KY vote data because it’s only available from the state in some pretty difficult-to-parse PDF files. – https://apps.legislature.ky.gov/record/19rs/house_votes/29_comm_votes.pdf

That said, if you have experience parsing PDFs (or are just up for a challenge), you should go clone our github – https://github.com/openstates/openstates – and then follow the getting started instructions in README.

In the case of KY, you might want to start by getting

docker-compose run scrape ky bills --scrape

going, then check your _data/ky folder and you should see a bunch of JSON files representing bills from the scraper.

From there you can look at the documentation on adding votes – https://opencivicdata.readthedocs.io/en/latest/scrape/bills.html and modify the scraper to grab those PDF files, parse them, and add votes to the associated bills.


#3

Adding to what Tim provided:

Are you familiar with Git and Github? If not, you might want to check out https://egghead.io/courses/how-to-contribute-to-an-open-source-project-on-github .

Only a few votes are roll call votes, and hence only a few have “who voted how”. But those few votes tend to be the most important ones.

Parsing the PDF file is only necessary for the per-legislator roll call data, which of course are the most interesting votes. But Kentucky is missing ALL vote records. I believe everything other than the per-legislator roll call data can be parsed from the HTML - see the current Kentucky bill scraper parse_actions for a starting point. You might want to start with just adding vote events (class pupa.scrape.VoteEvent) as a first PR, before tackling the PDF roll calls.

I believe the doc (https://opencivicdata.readthedocs.io/en/latest/scrape/bills.html) is at least trivially out of date, in that pupa.scrape.Vote was renamed at some point to pupa.scrape.VoteEvent. If you run into trouble with the scraper framework, I’d check another state’s vote scraper for other possible staleness.

The PR submittal process runs your code through flake8, a format checker, and rejects it if there is any lint. You may want to run flake8 locally before submitting a PR.

Good luck!


#4

@tims Gotcha. Thanks for the direction! I’ve seen those PDFs, and I see the problems. I’ll start with your suggestions and go from there.

One small, interesting thing: I got a chance to speak with a developer who works for Kentucky Legislative Research Council (keeper and publisher of legislative records in KY). He said that his hands are largely tied, but if there’s a chance he could change the format of the PDFs to make them more parse-able, do you all have model PDF output from another state? Something a little easier for the scrapers to read? I’m not sure what is in his power to change, but it might be worth an ask if we can provide a model for him.


#5

@EdStaub I am familiar with Github, but I don’t mind a refresher on how to contribute. Thanks for the link!

Also, I see what you’re saying about the HTML data not being processed fully. And yes, that looks like lower hanging fruit, so that would be a good place to start.

I’ll take a look at flake8 as well.


#6

Great @kenwrites – RE:PDFs – If it has to be PDF, anything that can be parsed with pdftotext and fairly simple regex is a huge improvement. Going to single column is also very helpful to automated parsing.

The misc vote format is not too bad, though single column would make it easier.

I know this is not super likely since those are scans, but since we’re wishlisting, PDF with the text actually embedded (as opposed to a photo of the text) is a big help too.

I don’t have a sample state offhand but here’s a good general format guide:

------------
Session: 2019
Bill: HB 1
Date: Vote Date in a consistent format
Vote: Motion to do X and Y. -- If there are more than one of the same motion on the same date, some kind of differentiator, can even just be a (2) or something.
Result: Pass

Yea: [names broken up by consistent separator, newline is fine too]
Nay: [names broken up by consistent separator, newline is fine too]
Abstain: [names broken up by consistent separator, newline is fine too]
Present: [names broken up by consistent separator, newline is fine too]
Not Voting: [names broken up by consistent separator, newline is fine too]

It can also be a list of names then the vote, that’s about equally easy to parse.

Anything beats circling the votes in pen, since that’s a real pain to OCR.


FWIW with scrapers, I usually start by getting a scrape running with --scrape and taking a look at the output files to get a sense of the format, then changing something stupid (change the title for every bill to “test” or something) just to see how it all works.

One more tip that can save you some time, if you pass `–fastmode``` to the scraper, it will use cached HTML instead of requesting the pages fresh. This means you won’t see pull new data from the state, but it saves time in DEV when you’re just trying to check how your changes look.


#7

@tims Gotcha. --fastmode sounds like a good tip. I’ll keep that in mind. If I run int too much trouble, I’ll see if the dev has the power to switch to a single-column PDF as well. Thanks again!