Curious - why does CA use SQLAlchemy & MYSQL?


#1

It caught me off-guard to be sure, what’s up with using them?

I’m going to fork the script and write something that doesn’t require these —

though if there is any reason I shouldn’t, I’m happy to be educated :smile:

Thank you!


#2

California provides SQL downloads that include the complete data- which is cleaner & more complete than the website. Wherever such a source exists we favor it over the much more fragile / time-consuming HTML scrapers. CA in particular has had a lot of site changes that would make a scraper difficult to maintain. I think there’s even data in the SQL file that isn’t available on the public site, but I don’t know that we make excellent use of it, as you’ve probably noticed the update process is more cumbersome than most.


#3

I am entirely convinced and had no idea about the additional info – thank you!


#4

To add to @james reply; it looks like this year they added most (maybe all?) of the data in the mysql database to tsv-style .dat files in the bulk download [1], but nobody has taken on the project of verifying that the data is the same and rewriting the scraper if so, since it would be a good chunk of work.

If anybody is interested in taking this on let me know, i’d be happy to walk you through how it works now.

[1] http://downloads.leginfo.legislature.ca.gov/


#5

OMG @tims please!! What’s the best way to connect?!?

long story short, what’s at MnActivist.org I want to implement at CaActivist.org, and am totally happy to put in the effort to get tackle this work. Long story short, all my tech broke, but I’ve got my RaspberryPi vpn’ed into a larger computer :wink: so as to get back to work until parts arrive, or I decide to suck it up and by a new computer.

Thank you


#6

Ah I’d missed this. Making this switch would be awesome- nobody ever wants to work on fixes to the CA scraper because setting that environment up is a bit of a nightmare.


#7

@canin shoot me an email at showerst@gmail.com and we can find a time to connect.

In the mean time if you’re familiar with legislative data already just grab one of those monster pubinfo_ files from the CA site I linked above, and take a look at the .dat files. (for sanity’s sake you can ignore the .lob and sql files, the download contains binary sql dumps too).

I’m not certain they’re a 1:1 for the database but hopefully!


#8

I had to build out a scraper for the CA data using the DB dumps and the TSV files for another project I’m contracted on. As of now, it’s not all the data (but pretty close to it) but wouldn’t take much to finish the import on the remaining data.

The scrapper is its own site with an API front end. I can check with them (since they paid to have it built) if they’ll let me open it up to let OpenStates use it. It downloads and collects every 3 hours from CA and imports all the data from the daily dumps.


#9

that’d be greatly appreciated if you can contribute any parts of that :slight_smile:


#10

Sorry for the delay in response but they declined my request. I will say it’s not 100% everything on the site but most of the data is in there.