Feasibility of running local scrapers as a replacement for API usage


#1

Hello all. We’re considering switching to running the OpenStates scrapers locally instead of making calls to the API to get data on state legislators. I believe the biggest problem with this would be that we would lose access to the fixes that have been made in the OpenStates database that aren’t available in the scraper code. Am I correct about that? Is there some way we could get access to these changes locally?

Thanks for the help!


#2

Likewise, in my case because of the New Hampshire situation (not a code issue), and more recently, weird access issues. I believe this is what Tim Showers does at GovHawk, you might want to ping him.

Can you share what is moving you in that direction?


#3

We suspect there are problems with how the scraper data is being used in the API. For example, see my other thread - the scrapers have the right data for MA and NY state legislators but the API is returning incorrect data. Correct data being better than incorrect data, we’re thinking about switching! On the other hand, I believe we’d see some regressions if we switched since some legitimate issues have been fixed in the database and not in the scrapers.

What is the NH situation? What do you mean by weird access issues?


#4

I think the issue with legislators has to do with trying to keep historical data. They can’t just be deleted if they stop showing up. On the other hand, the active flag should be being turned off, and I’d suggest posting an issue to that effect, showing the cases you’re seeing.

The weird access issues are hopefully transient and hopefully not worth getting into; see What is current API v1 rate limit? . The NH issue is that it’s not been scraped for almost two months; see https://github.com/openstates/openstates/issues/2220 .


#5

Neither of these issues relates to the feasibility of running scrapers locally, but I suppose they may be additional reasons to want to do that! Thanks for the notes.


#6

Running scrapers locally is generally a bad idea in our opinion, though we’ve worked with people to make exceptions.

a few reasons:

  • in the past we have seen our access revoked, since the load from extra scrapers has been an issue for some states, particularly if people don’t respect scraping ettiquite
  • also we are completely unable to support any running off scrapers or make any commitments to stability there. while we haven’t broken API compatibility in years, we do break the scrapers API occasionally as it isn’t considered a public interface
  • two big upcoming projects aim at improving user submitted data/the process, users not using the API won’t get any of those advantages

if you do decide to go ahead anyway, please be respectful of the state sites- we recommend contacting them on behalf of your organization beforehand


#7

Thanks James, this is useful information. I’m a little surprised to hear this given the new and much more restrictive API usage limits for the OpenStates API. I would think that was designed to push people to do more of their own data access rather than lean on the OpenStates API.


#8

the limits were calculated based on normal usage and should only affect about 4 percent of users, they are intended to protect availability. if you’re affected we can work out an increase


#9

Ok, good to know. I know we are affected but that’s just because we’re fetching the whole legislator set at once as fast as possible. There’s no particular reason we have to fetch that fast aside from developer impatience.