A different scraping strategy


#1

Hi, everyone. I’ve been following the openstates development for a while, because we’re doing something similar: scraping state laws and making them easy to access. It looks like we’re taking a different approach, though, so I thought I’d present what we do.

  • Writing scrapers (in any programming language) for each state or country. E.g., Nevada and Oregon. Each is responsible for just converting the source docs into a stable JSON schema mimicking their structure with no semantic changes. Although the JSON interface lets us write these in any language, we’re using strongly & statically typed langs so far (e.g. Haskell), in order to lock down the JSON output.
  • We use Test Driven Development on fixtures which are samples of the current original files.
  • We then import these with (closed source for now) Ruby on Rails code which converts the JSON tree into the Rails app’s data model — so a separate import adapter for each jurisdiction here. This is somewhat TDD’d, but only about 40%. The Rails models and database have full validations and constraints to keep the data clean.

We’ve found that our scraper test suite is a good self-check before a run: we can update the test fixtures and re-run the tests on them. We then get an advance warning if the source doc formats have changed enough that a code change is necessary.

Our current big project is a pipeline to run this code, and update all jurisdictions daily.


#2

I’m not an OpenStates project member, so to some extent I speak out of turn.

Other readers will want to quickly note that (AFAICT) you’re scraping statutes, not bills, legislators, votes, committees, etc.

It seems like the biggest source of regression for OpenStates is the evolution of the state sites that the data is scraped from. If you see signs that that this will come to be your experience too, you may want to focus more on detection and robustness (to the extent possible!) in the face of this kind of churn. As a simple example, if a state suddenly seems to have dissolved into anarchy (no laws!?), you might find this dubious and ensure that the scrape is discarded and maintainers alerted.

Are you trying to use the same JSON schema and constraints across all states? If so, would you mind pointing at it on Github?

Good luck!


#3

Hi @dogweather – Cool project!

It sounds like we’re actually taking a very similar approach. Under the hood the scrapers are just a set of state-specific code that all dump into the same data model and be checked, which can then be exported to JSON or directly inserted into a DB. Our docs don’t always make this super clear, which is something we hope to improve soon.

Openstates made the decision long ago to keep all the scrapers in one language to ease maintainability, as we’ve seen many people dip in and out of the project over the years.

Pupa has built in tests for things like what a bill number looks like, references, etc, but our scraped sites change so much more than laws would that we find it’s easier to have ‘fragile’ scrapers that blow up when they don’t see an expected html element, than risk bad data getting in.

The difference from a heavy TDD approach here is mainly that our data can be hard to test programmatically. A human can look at a bill title pretty quickly and see that “Home Bill Legislators Search” is an errant snippet of HTML that snuck in due to a site redesign, and not a bill’s proper title, but that’s tough to ASSERT for. Legal data is a bit more structured so it may make more sense for you to scrape first -> verify later.

Have you considered releasing your parsed JSON’s for each state as a repo so that users don’t have to run their own scrapers? That’s impractical with bill data but seems like it would be really useful for laws.


#4

I thought I’d be doing that at first, but so far it’s been extremely efficient just to have the guaranteed well formed JSON … even though each has a schema specific to a jurisdiction.

The next architectural step will be to define a unified JSON representation and have a second layer of adapters produce that.


#5

I hadn’t thought of it, but that’s a great idea. It would also mean that we could try to use a GitHub or GitLab as our diffing and message queuing tool. The scrapers can run daily and store the JSON as a commit.


#6

So this is a pretty exciting idea. In preparation I’ve dockerized the nevada parser. Anyone can run it now with zero installation:

$ docker run publiclaw/nevada-nrs > nrs.json

That’ll be the interface to run any scraper. This just needs to get executed by a script which compresses and commits the JSON to GitHub. (The JSON is close to GitHub’s 100MB file size limit.)