Search features


#1

Hey folks,

I have been developing legalerts.us slowly over the last couple years and finding it helpful to myself and the few others that use it.

One recurring feature wish I have is for better search functionality via the OS API. I see from the GitHub issues that many of the things I wish for (bill id, case insensitivity, fuzzy, full text) are already represented on GH.

One of the things I have also been thinking about is some automatic meta-tagging of bill text, so that if (e.g.) a bill about gun control exists but does not use the phrase “gun control” (or even “gun”) in it that a search could find it.

I have a good deal of experience building search engines (I created dezi dot org and was a contributor to Apache Lucy and Swish-e amongst other things), and would love to contribute to OS in this way. My preference is to build search functionality outside the db though, using a tool built for search (Elasticsearch, Solr, Dezi, etc). This architecture has a few advantages (fast, feature-ful, scalable) but does add complexity to an overall system architecture, so I’m curious what y’all think about that and whether you are open to such a design. I wrote a bit about how I did this at 18F: https://18f.gsa.gov/2016/04/08/how-we-get-high-availability-with-elasticsearch-and-ruby-on-rails/

Happy to chat more about this, here or in real-time.

pek


#2

Hi pek,

Good area to work on.

Are you using the new OS API v2?


#3

FWIW, I’m working on a parallel effort: an app to help with manual categorization and indexing of bills.


#4

I am not using the v2 API yet no. I rely on the https://github.com/WideEyeLabs/ruby-openstates gem and it is still on v1.


#5

@EdStaub - interesting! is your work public yet?


#6

@karpet No, far from, I just started on it. I started with the notion of a need to be able to categorize bills into a dozen or so buckets, but it quickly grew, because categorization is so subjective. The notion now is to reproduce all the features of a good book index set, like cross-references, sub-entries, etc.


#7

We’re definitely considering introducing search functionality, right now we’re leaning towards using Postgres’ newer full text search features- I’ve found on a number of work projects that the complexity of duplicating data to ES isn’t always worth the trade-offs, especially since our volunteer time is at a premium.

This isn’t decided yet though, I’d be willing to hear arguments.

Note for interested parties: The first component of this is here: https://github.com/openstates/text-extraction and we’ll need help there regardless of how we decide to index the underlying data.


#8

Yes james, I would suggest you guys to try Postgres here. Would definitely checkout out this github project. Also, definitely here to help you guys in case you need a tester.


#9

Thanks James.

It would help me to understand the infrastructure a little to know best where to help. The particular indexing tool is less important to me than knowing:

(a) how is bill text currently stored?
(b) is there a way via the API to see what bills have been crawled within a given time period?

The text-extraction lib seems straightforward enough. How does it currently fit into your pipeline?

My assumption is that you have crawlers running via some kind of schedule that fetch new/changed bills, extract the text, and then store the text somewhere (s3?) for later.

Happy to read up on this if it’s documented somewhere already.

pek