Great @kenwrites – RE:PDFs – If it has to be PDF, anything that can be parsed with pdftotext and fairly simple regex is a huge improvement. Going to single column is also very helpful to automated parsing.
The misc vote format is not too bad, though single column would make it easier.
I know this is not super likely since those are scans, but since we’re wishlisting, PDF with the text actually embedded (as opposed to a photo of the text) is a big help too.
I don’t have a sample state offhand but here’s a good general format guide:
Bill: HB 1
Date: Vote Date in a consistent format
Vote: Motion to do X and Y. -- If there are more than one of the same motion on the same date, some kind of differentiator, can even just be a (2) or something.
Yea: [names broken up by consistent separator, newline is fine too]
Nay: [names broken up by consistent separator, newline is fine too]
Abstain: [names broken up by consistent separator, newline is fine too]
Present: [names broken up by consistent separator, newline is fine too]
Not Voting: [names broken up by consistent separator, newline is fine too]
It can also be a list of names then the vote, that’s about equally easy to parse.
Anything beats circling the votes in pen, since that’s a real pain to OCR.
FWIW with scrapers, I usually start by getting a scrape running with
--scrape and taking a look at the output files to get a sense of the format, then changing something stupid (change the title for every bill to “test” or something) just to see how it all works.
One more tip that can save you some time, if you pass `–fastmode``` to the scraper, it will use cached HTML instead of requesting the pages fresh. This means you won’t see pull new data from the state, but it saves time in DEV when you’re just trying to check how your changes look.