Next steps for development, and a job posting

With our recent analysis of Iraq security contractor documents, the Overview prototype has been used for its first real story. But our prototype is just that: a proof-of-concept tool, built as quickly as possible to validate certain algorithms and approaches. The next step is to create a solid architecture for future work. We need to make this technology web-deployable, scalable, and integrated with DocumentCloud.

If you haven’t already, take a look at our writeup of how we used the Overview prototype for our Iraq security contractors work. We started with documents posted to DocumentCloud, then downloaded the original PDF files for processing with a series of Ruby scripts. After processing, we used the prototype visualization interface, written in Java, to find topics and tag documents in bulk according to their subject. We’d like to streamline this whole process, so that Overview works like this:

  • Upload raw material to DocumentCloud.
  • Select documents for exploration in Overview, by using the DocumentCloud project and search functions.
  • Launch Overview, directly in the browser. Uses the visualization tools to explore the set, create subject tags, and apply them to the documents.
  • Export Overview’s tags back into native DocumentCloud tags and annotations.

In short, we want to tightly integrate Overview’s semantic visualization with DocumentCloud’s storage, search, viewing, annotation, and management tools. This means that Overview has to have a web front end, which means the interface needs to be Javascript, not Java. We also suspect that for performance reasons, the visualizations will need to be rendered in WebGL. On the back end, Ruby is just too slow for natural language processing  For example our common bigram detection code (which helps Overview discover frequently used two-word phrases) takes several minutes to build an intermediate table with hundreds of thousands of elements for the 4,500 pages of the Iraq contractor set. We’d like the Overview architecture to scale to millions of pages — a thousand times larger, which would take days with the current algorithm. So the server-side processing needs to be implemented in a higher performance language, such as Java.

The good news is that Overview uses one of the same basic data structures as search engines, a TF-IDF weighted index. DocumentCloud uses the popular Solr search platform, so integration with DocumentCloud will also pave the way for integration with any application which is based on Solr. That’s a lot of possible applications.

Given all of the above, this is the current planned order of  development tasks, which we think could be accomplished in about a year by a competent engineer: Rewrite the prototype with a Java backend and JavaScript/WebGL UI. Integrate the user experience with DocumentCloud’s tagging system. Then integrate with back end the Solr index data structures and APIs. As we go, we’ll collect feedback from our growing tester and user community and decide what to build next — there is a wide range of problems we could address.

We’re hiring two engineers on a full-time basis to accomplish this, perhaps one person who’s more inclined to the user interface, and one who is more into the back end processing. We’re looking for

  • Solid Java or JavaScript engineering experience, preferably 3-5 years of work on large applications.
  • Familiarity with open source development projects.
  • Experience in computer graphics, visualization, natural language processing, or distributed systems a plus.

This is a contract position. We’d prefer if you worked with us out of the AP offices in New York, but we’ll consider remote contributors. Please contact jstray@ap.org if interested.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>