Hi, everyone. I’ve been following the openstates development for a while, because we’re doing something similar: scraping state laws and making them easy to access. It looks like we’re taking a different approach, though, so I thought I’d present what we do.
- Writing scrapers (in any programming language) for each state or country. E.g., Nevada and Oregon. Each is responsible for just converting the source docs into a stable JSON schema mimicking their structure with no semantic changes. Although the JSON interface lets us write these in any language, we’re using strongly & statically typed langs so far (e.g. Haskell), in order to lock down the JSON output.
- We use Test Driven Development on fixtures which are samples of the current original files.
- We then import these with (closed source for now) Ruby on Rails code which converts the JSON tree into the Rails app’s data model — so a separate import adapter for each jurisdiction here. This is somewhat TDD’d, but only about 40%. The Rails models and database have full validations and constraints to keep the data clean.
We’ve found that our scraper test suite is a good self-check before a run: we can update the test fixtures and re-run the tests on them. We then get an advance warning if the source doc formats have changed enough that a code change is necessary.
Our current big project is a pipeline to run this code, and update all jurisdictions daily.