1. What problem does Overview solve?

Overview is intended to help journalists and members of the public make sense of massive, disorganized collections of electronic documents.

Such document dumps might be published as part of government or corporate transparency initiatives, released due to Freedom of Information requests, or leaked to the press. Very often they consist of thousands if not millions of pages of information, provided without any sort of index or summary. It’s like trying to make your way through a vast room containing unlabeled files stacked to the ceiling.

There might be something of interest in such files. Or there might not be. Maybe there is somewhere a document with a “smoking gun” that proves wrongdoing, or maybe there’s a much broader pattern to be found. But it could take years — literally — for a reporter to read every page.

Overview addresses this problem by producing interactive, explorable maps of the contents of very large numbers of documents. The purpose is not so much to help people find what they are looking for, as it is to give a detailed understanding of “what’s in there” when looking at a huge, unstructured database or document dump.

2. Aren’t there already systems that help people find information in millions of documents?

Yes, but they have significant limitations. Search engines of various types can only help users find what they are already looking for — they can’t suggest interesting unexplored stories, or show the topics that the documents cover. Also, search engines do not create visualizations. Visual displays of information are powerful because they leverage the human eye’s amazingly sensitive pattern detection ability. But existing visualization tools either don’t handle large document sets well, or are proprietary, inflexible, and too expensive for newsrooms and interested citizens.

3. What about DocumentCloud? Isn’t that a document set tool for journalists?

DocumentCloud addresses the key problems of document input, OCR, storage, sharing, annotation, search, and publishing. It does not provide visual analytics capability. So, for example, DocumentCloud will let you search for place names, but it can’t plot all mentioned places on a map. Overview integrates with DocumentCloud, and is designed to import and visualize your DocumentCloud projects.

4. How do I get my documents into Overview?

First upload them to a DocumentCloud project, then import them into Overview. More detailed instructions here.

5. What types of documents can Overview handle?

Overview is designed for text: lots of text, plain human text. It is not designed for tables, primarily numeric data, or a dump of records from a database — unless there is a field that has plain English text in it. The algorithms are designed to work on the sort of narrative text that a human would actually read.

We support English only at the moment, though multiple language support is not too difficult and is coming; if you have need of a specific language, tell us!

6. What file formats can Overview import?

Because Overview relies on DocumentCloud for storage, it can import any document that you have successfully uploaded to DocumentCloud, including PDF, Word, HTML, and plain text. A complete list is here. If your documents are not in a supported format, you’ll have to find some way to convert it first. Sorry :(

7. I still don’t really understand what Overview would do. Can you give a concrete example?

Sure! See this full-text visualization of the Iraq war logs for an example of one kind of visualization we wish to implement. Or watch this video of a demo we did at the NICAR 2011 conference.

8. What role does the Associated Press play in the development of Overview?

Overview is a project of the Associated Press, supported by the John S. and James L. Knight Foundation. The AP is providing management resources and office space. The real-world environment of a global newsroom is the ideal place to develop state of the art reporting technologies, and the AP’s broad network of members and customers are a built-in user community.

9. There are already lots of people working on data and document visualization projects. Why do another one?

There are many people experimenting with visualization tools and technology for journalism, but Overview is uniquely positioned to succeed in producing a professionally engineered, real-world tool. We have set a baseline size of 10 million documents, which has forced us to adopt a high-performance, scalable, cloud-based architecture. This brings the tool well out of the realm of prototype, and we are dedicated to producing not just a technology experiment but a useful open platform for further development. Most crucially, Overview is being developed in the newsroom by working journalists.

The project also brings together a strong multidisciplinary team. Jonathan Stray (Associated Press) was formerly a senior computer scientist at Adobe Systems and has turned many high-performance computer graphics techniques into shipping products. Rick Pienciak (Associated Press) is a veteran investigative reporter and currently heads the investigative team at the AP. Dr. Tamara Munzner (University of British Columbia) is a noted expert in information visualization. Dr. John Stasko (Georgia Tech) has spent the last five years developing a ground-breaking visual analytics system for law enforcement investigations.

10. I have a question you haven’t answered.

Ask us!