Confusion on "Start Contributing to Open States" Docs


#1

I’m new to OS and am trying to get up to speed by reading the “Start Contributing to Open States” page in the docs. I have a couple of questions:

  1. What should I do when a local scrape fails in the middle of a job? I have thousands of JSON files in _data, is there a way to run the ETL step and load the files that exist to Postgres?

  2. Where does the MongoDB instance come from and what data is stored there vs in the Postgres db? I followed the instructions on the Github Readme and can connect to Postgres, but I’m confused where MongoDB comes into the equation.

  3. Is pupa a complete replacement for billy or how should I use the two? Is there documentation for pupa somewhere?


#2

I can help a little.

If you’ve not heard already: scraping into your own db is not recommended, in large part because of the lack of scalability of many of the state websites.

  1. I believe that “Start Contributing” doc is stale. AFAIK there isn’t a mongodb database any longer.

  2. pupa structurally replaces billy, but the schema stored in the db is completely different. There’s some doc, but it’s not great. github


#3

Thanks for the prompt response! Regarding my first question on local scraping - I understand why its not recommended but given the rate limits of the API and the lack of up to date bulk data downloads, I believe running the scrapers locally is my only option (100% open to other suggestions). My use case is data science research, so I don’t need the data to be kept up to date once I pull it but I do need a lot of it upfront.

So is there a way to import the already scraped json files into the Postgres db, or to restart a job at the point it failed? I know running locally is not suggested, but restarting a failed job from scratch each time seems to be the worst of all worlds.


#4

So is there a way to import the already scraped json files into the Postgres db, or to restart a job at the point it failed? I know running locally is not suggested, but restarting a failed job from scratch each time seems to be the worst of all worlds.

You can refetch only the kind of records you need.to rerun, e.g. “bills”. See readme .

Without knowing exactly what you’re fetching, it’s hard to be sure, but I’d be surprised if fetching from openstates using the v2 graphql API is slower than scraping from the states, even considering rate limits. On the other hand, the learning curve to efficient use is a bit steep.

The scrapers for some states are currently failing. See http://bobsled.openstates.org/ and https://github.com/openstates/openstates/issues .