Visual document mining for journalists

Keyboard shortcuts in Overview

It’s a little known fact that Overview has several keyboard shortcuts to make navigating through your documents even faster:

  • j, k – view next and previous document in the list.
  • arrow keys — navigate through tree. Selects parent, child, and sibling folders.
  • u — go back to document list, when viewing a single document.

Both of these sets of keys are essential for rapid review. You can select a folder, press j to read the first document (which automatically switches from the document list to the single document view) and then press right arrow to go to the next folder in the tree.

Algorithms are not enough: lessons bringing computer science to journalism

There are some amazing algorithms coming out the computer science community which promise to revolutionize how journalists deal with large quantities of information. But building a tool that journalists can use to get stories done takes a lot more than algorithms. Closing this gap has been one of the most challenging and rewarding aspects of building Overview, and I really think we’ve learned something.

Overview is an open-source tool to help journalists sort through vast troves of documents obtained through open government programs, leaks, and freedom of Information requests. Such document sets can include hundreds of thousands of pages, but you can’t find what you don’t know to search for. To solve this problem, Overview applies natural language processing algorithms to automatically sort documents according to topic and produce an explorable visualization of the complete contents of a document set.

I want to get into the process of going from algorithm to application here, because — somewhat to my surprise — I don’t think this process is widely understood.  The computer science research community is going full speed ahead developing exciting new algorithms, but seems a bit disconnected from what it takes to get their work used. This is doubly disappointing, because understanding the needs of users often shows that you need a different algorithm.

The development of Overview is a story about text analysis algorithms applied to journalism, but the principles might apply to any sort of data analysis system. One definition says data science is the intersection of computer science, statistics, and subject matter expertise. This post is about connecting computer science with subject matter expertise.

The algorithmic honeymoon

In October 2010 I was working at the Associated Press on the recently released Iraq War Logs. AP reporters toiled for weeks with a search engine to find stories within these 391,832 documents. It was painful, and there had to be a better way.

Rather than deciding what to look for I wanted the computer to read the documents and tell me what was interesting. I had a hunch that classic techniques from information retrieval (TF-IDF and cosine similarity) might work here, so I hacked together a proof-of-concept visualization of one month of the Iraq War Logs data using Ruby and Gephi.

And it worked! By grouping similar documents together and coloring them by incident type we were able to see the broad structure of the war. It was immediately clear that most of the violence was between civilians, and we found clusters of events around tanker truck explosions, kidnappings, and specific battles.

A few months later we had a primitive interactive visualization. It was exciting to see the huge potential of text analysis in journalism! This was our algorithmic honeymoon, when the problems were clear and purely technical, and we took big steps with rapid iterations.

But that demo was all smoke and mirrors. It was the result of weeks of hacking at file formats and text processing and gluing systems together and there was no chance anyone but myself could ever run it. It was research code, the place where most visualization and natural language processing technology goes to die. No one attempted to do a story with the system because it wasn’t mature enough to try.

Worse, it wasn’t even clear how you would do a story starting from one of these visualizations. Yes, we could see patterns in the data, but what did those patterns mean and how would we turn them into a story? In retrospect, this uncertainty should have told us that despite our progress in algorithms, we didn’t yet understand the journalism part of the problem.

Getting real work done

The next step was a prototype tool, initially developed by Stephen Ingram at UBC and completed by the end of 2011. This version introduced the topic tree and its folders for the first time. And I had a document set: 4,500 pages of recently declassified reports concerning private security contractors in Iraq. Trying to do a story about these documents taught us a lot about the difference between an algorithm and an application.

The moment I began working with these documents — as a reporter, not a programmer — I discovered that it was stupendously important to have a smooth integrated document viewer. In retrospect it seems obvious that you’ll need to read a lot of documents while doing document mining, but it was easy to forget that sort of thing in the midst of talking about document vectors and topic modeling and fancy visualizations. I  also found that I needed labels to show the size of each cluster, got frustrated at the overly complex tagging model, and implemented more intuitive panning and zooming in the scatterplot window. A few weeks of hacking eventually got me to a system I could use for reporting.

The final story included the results of this document set analysis, reporting from other document sources, and an interview with a State Department official. This was the first time we were able to articulate a reporting methodology: I explored the topic tree from left to right, investigating each cluster and tagging to mark what I’d learned, then followed up with other sources. The aim of the reporting process was to summarize and contextualize the content of a large number of documents. This was a huge step forward for Overview, because it connected the very abstract idea of “patterns in the data” to a finished story. We thought all document set reporting problems would take this form. We were wrong.

Just as the proof-of-concept was research code, the prototype was the kind of code you write on deadline to get the story done by any means necessary. The data journalism community often writes and releases code written for a single story. This is valuable because it demonstrates a technique and might even provide building blocks for future stories, but it’s usually not a finished tool that other people can easily use.

We learned this lesson vividly when we tried to get people to install the Overview prototype. Git clone, run a couple of shell scripts to load your documents, how hard could it be? It turned out to be very hard. At NICAR 2012 I led a room full of people through installing Overview and loading up a sample file. We had every type of problem: incompatible versions of git, Ruby, and Java; operating system differences; and lots of people who had never used a command line before. Of 20 people who tried, only 3 got the system working. We were beginning to make contact with our user community.

Usability trumps algorithm

We re-wrote Overview as a web application to solve our installation woes (largely the work of Jonas Karlsson and Adam Hooper). We also dropped the scatterplot visualization, the visualization that we had started with, because log data and user interviews showed no one was using it. We went all-in on the tree and had a deployed system by the end of 2012.

Do you understand what is happening in this screenshot? Is it clear to you that the window on the lower left is a list of documents, each represented a line of extracted keywords? It wasn’t obvious to our users either, and no one used this new system for many months.

We knew that Overview was useful, because we and others had done stories with the prototype. But we were now expecting new people to come in fresh and learn to use the system without our help. That wasn’t happening. So we did think-aloud usability testing to find out why. We immediately discovered a number of serious usability problems. People were having a hard time getting their documents into Overview. They didn’t understand the document list interface. They didn’t understand the tree.

We spent months overhauling the UI. We hired a designer and completely rebuilt the document list. And based on user feedback, we changed the clustering algorithm.

During the prototype phase we had developed a high-performance document clustering algorithm based on preferentially sampling the edges between highly similar documents and building connected components, documented in this technical report. We were very proud of it. But it tended to produce long lists of very small clusters, meaning that each folder in the tree could have hundreds of sub-folders. This was a terrible way to navigate through a tree.

So we replaced our fancy clustering with the classic k-means technique. We apply this recursively, splitting each folder into at most five sub-folders according to an adaptive algorithm.The resulting tree is not as faithful to the structure of the data as our original clustering algorithm, but that doesn’t matter. Overview’s visualization is for humans, not machines. The point is not to have a hyper-accurate representation of the data, but a representation that users are able to interpret and trust. For this reason, it was absolutely necessary to be able to explain how Overview’s clustering algorithm works for a non-technical audience.

What do journalists actually do with documents?

We solved our usability problems by the summer of 2013 and journalists began to use our system; we’ve had a great crop of completed stories in the last six months. And as we gained experience we finally began to understand what journalists actually do with a set of documents. We have seen four broad types of document-driven stories, and only one of them is the “summarize these documents” task we originally thought we wanted to support. In other cases the journalist is looking for something specific, or needs to classify and tag every document, or is looking to separate the junk from the gold.

Today we have a solid connection to our users and their problems.  Our users are generally not full-time data journalists and have typically never seen a command line. They get documents in every conceivable format, from paper to PDF. Even when the material is provided in electronic form it may need OCR, or the files may need to be split into their original documents.  Our users are on deadline and therefore impatient: Overview’s import must be extremely quick or reporters will give up and start reading their documents manually. And each journalist might only use Overview once a year when a document-driven story comes their way, which means the software cannot require any special training.

We learned what journalists actually wanted do, and we implemented features to do it. We implemented fuzzy search to help find things in error-prone OCR’d material. We added an easy way to show the documents that don’t yet have tags for those projects where you really do need to read every page. And Overview now supports multiple languages and lets you customize the clustering. We are still working on handling a wide range of import formats and scenarios including integrated OCR.

This is what the UI looks like today.

Algorithms are not enough

Overview began when we saw that text analysis algorithms could be applied to journalism. We originally envisioned a system for stringing together algorithmic building blocks, a concept we called a visualization sketching system. That idea was totally wrong, because it was completely disconnected from real users and real work. It was a technologist’s fantasy.

Unfortunately, it appears that much of the natural language processing, machine learning, and visualization community is stuck in a world without people. The connection between the latest topic modeling paper and the needs of potiential users is weak at best. Such algorithms are evaluated by abstract statistical scores such as entropy or precision-recall curves, which do not capture critical features such as whether the output makes any sense to users. Even when such topic models are built into exploratory visualization systems (like this and this) the research is typically disconnected from any domain problem. While it seems very attractive to build a general system, this approach risks ignoring all real applications and users. (And the test data is almost always academic papers or news archives, both of which are unrealistically clean.) We are seeing ever more sophisticated technique but very little work that asks what makes one approach “better” than another, which is of course highly dependent on the application domain.

There is a growing body of work that recognizes this. There is work on designing interpretable text visualizations, research which compares document similarity algorithms to human ratings, and evolving metrics for topic quality that have been validated by user testing.  See also our discussion of topic models and XKCD. And we are beginning to see advanced visualization systems evaluated with real users on real work, like this.

It’s also important to remember that manual methods are valuable. Reporters will spend days reading and annotating thousands of pages because they are in the business of getting stories done. Machine learning might help with categorize-and-count tasks, but the computer is going to make errors that may compromise the accuracy of the story, so the journalist must review the output anyway. The best system would start with  a seamless UI for manual review and tagging, then add a machine learning boost. This is exactly what we are pursuing for Overview.

Our recommendation to technologists of all stripes is this: get out more. Don’t design for everyone but for specific users. Make the move from algorithmic research to the anything-goes world of getting the work done. Optimize your algorithms against user needs rather than abstract metrics; those needs will be squishy and hard to measure, but they’ll lead you to reality. Test with real data, not clean data. Finish your users’ projects, not your projects. Only then will you understand enough to know what algorithms you need, and how to build them into a killer app.

What do journalists do with documents? The different kinds of document-driven stories

Large sets of documents have been central to some of the biggest stories in journalism, from the Pentagon Papers to the Enron emails to the NSA files. But what, exactly, do journalists do with all these documents when they get them? In the course of developing Overview we’ve talked to a lot of journalists about their document mining problems, and we’ve discovered that there are several recurring types of document-driven story.

The smoking gun: searching for specific documents

In this type of work the journalist is trying to find a single document, or a small number of documents, that make the story. This might be the memo that proves corruption, the purchase order that shows money changed hands, or the key paragraph of an unsealed court transcript.

Jarrel Wade’s story about the Tulsa police department spending millions of dollars on squad car computers that didn’t work properly began with a tip. He knew there was an ongoing internal investigation, but little more until his FOIA request returned almost 7000 pages of emails. His final story rested on a dozen key documents he discovered by using Overview to rapidly and systematically review every page.

This story demonstrates several recurring elements of smoking gun stories. Wade had a general idea what the story was about, which is why he asked for the documents in the first place. But he didn’t know exactly what he was looking for, which makes text search difficult. Keyword search can also fail if there are many different ways to describe what you’re looking for, or if you need to look for a word that has several meanings — imagine searching for “can” meaning container, and finding that almost every document contains “can” meaning able. Even worse, OCR errors can prevent you from finding key documents, which is why Overview supports fuzzy search.

The trend story: getting the big picture

As opposed to a smoking gun story where only specific documents matter, a trend story is about broad patterns across many documents. A comprehensive analysis of a comprehensive set of documents makes a powerful argument.

For my story about private security contractors in Iraq I wanted to go beyond the few high-profile incidents that had made headlines during the height of the war. I used Overview to analyze 4,500 pages of recently declassified documents from the U.S. Department of State in order to understand the big picture questions. What were the day-to-day activities of armed private security contractors in Iraq? What kind of oversight did they have? Did contractors frequently injure civilians, or was it rare?

Overview showed the broad themes running across many documents in this unique collection of material. Combined with searches for specific types of incidents and a random sampling technique to back my claims with numbers, I was able to tell a carefully documented big picture story about this sensitive issue.

Categorize and count: turning documents into data

Some trend stories depend on hard numbers: of 367 children in foster care, 213 were abused. 92% of the complaints concerned noise. The state legislature has never once proposed a bill to address this problem. This type of story involves categorizing every document according to some scheme. Both the categories you decide to use and the number of documents in each category can be important parts of the story.

For their comprehensive report on America’s underground market for adopted children, Ryan McNeill, Robin Respaut and Megan Twohey of Reuters analyzed more than 5000 messages from a Yahoo discussion group spanning a five year period.  They created an infographic summarizing their discoveries: 261 different children were advertised over that time, from 34 different states. Over 70% of these children had been born abroad, in at least 23 different countries. They also documented the number of cases where children were described as having behavioral disorders or being victims of physical or sexual abuse. When combined with narratives of specific cases, these figures tell a powerful story about the nature and scope of the issue.

Overview’s tagging interface is well suited to categorize-and-count tasks. Even better, it can take much of the drudgery out of this type of work because similar documents are automatically grouped together. When you’re done you can export your tags to produce visualizations. We are planning to add machine learning techniques to Overview so that you can teach the system how you want your documents tagged.

Wheat vs. chaff: filtering out irrelevant material

Sometimes a journalist gets ahold of a lot of potentially good material, but you can’t publish potential. Some fraction of the documents are interesting, but before a reporter can report they have to find that fraction.

In the summer of 2012 Associated Press reporter Jack Gillum used FOIA requests to obtain over 9,000 pages of then-VP nominee Paul Ryan’s correspondence with over 200 Federal agencies. He used Overview to find the material that Ryan had written himself, as opposed to the thousands of pages of attachments and other supporting documents. By analyzing Ryan’s letters he was able to show that the Congressman was privately requesting money from many of the same government programs he had publicly criticized as wasteful.

This type of analysis is somewhere between the smoking gun and trend stories: first find the interesting subset of documents, then report on the trends across that subset. Overview was able to sort all of Ryan’s correspondence into a small number of folders because it recognized that the boilerplate language on his letterhead was shared across many documents.

What do you need to do with your documents?

These are the patterns we’ve seen, but we’re also discovered that there are huge variations. Every story has unique aspects and challenges. Have you done an interesting type of journalistic document set analysis? Do you have a problem that you’re hoping Overview can solve? Contact us.

Advanced search: quoted phrases, boolean operators, fuzzy matching, and more

Overview now supports advanced syntax in the search field, like this:

This gives you an enormous amount of control over finding specific documents.

  • Use AND, OR and NOT to find all documents with a specific combination of words or phrases.
  • Use quotes to search for multiple word phrases like “disaster recovery” or “nacho cheese”.
  • Use ~ after any word to do a fuzzy search, like “Smith~”. This will match all words with up to two characters added, deleted, or changed. Great for searching through material with OCR errors.
  • By default, multiple words are now ANDed together if you don’t specify anything else.
  • Use “title:foo” to search for all documents with the word “foo” in the title (the title is the upload filename, or the contents of the title column if you imported from CSV)
  • You can use wildcards like * and ? or even full regular expressions using the syntax text:/regex/ or title:/regex/. Note that regular expressions cannot have spaces in them, because you are actually searching through the index of words used in the documents, not the original text.

There are actually many more things you can do with this powerful search tool. See the ElasticSearch advanced query syntax for details.

Getting your documents into Overview — the complete guide

The first and most common question from Overview users is how do I get my documents in? The answer varies depending the format of your material. There are three basic paths to get documents into Overview: as multiple files, from a single CSV file, and via DocumentCloud. But there are several other tricks you might need, depending on your situation.

 

1. The simple case – a folder full of documents

Overview can directly read many file types, such as PDFs, Word, PowerPoint, RTF, text files, and so on. Choose “Upload document files” and add each document using the file selection dialog box.

If you are using the Chrome browser, you can also select entire folders at once. If you are not using Chrome, you can add all files in a folder in one step by clicking on the first file, then scrolling down and shift+clicking on the last file.

There are two potential snags: if you are uploading PDFs files, they must be searchable. Overview will not OCR files that are scanned images. Also, you may need to split long documents into individual pages. See below for details on both of these cases.

 

2. Journalist collaboration – DocumentCloud project import

DocumentCloud is collaborative document hosting service operated exclusively for professional journalists. It supports upload, OCR, search, annotation, publishing of documents, which may be public or private. Overview can import a DocumentCloud project directly, and DocumentCloud’s annotation tools are available directly within the Overview document viewer. This method is a good choice if you have access to DocumentCloud and the document set isn’t too large, up to a few thousand documents.

 

3. Documents as data – a CSV file

Overview can read an entire document set as a single CSV file, with one row per document containing the document text and optional information such as URL, tags, and unique ID.  This might sound more complex than it is; the simple format is documented here. A CSV file is a good choice if you need to bring data in from another application, such as a social media monitoring tool. You also use a CSV file to import tags, which can be used to compare text to data.

 

4. Scanned images –  OCR the documents

Not all PDFs are created equal. Some are actually images, not text, which frequently happens when the source material was paper. More about that. To see if your PDF file actually contains text, try searching within the file or cutting and pasting text. If this doesn’t work, your documents need to be converted to text, a process that goes by the delightfully anachronistic name of Optical Character Recognition (OCR). You have several options:

  • Go through DocumentCloud, which has built-in OCR
  • Use a commercial package such as Abby or Omnipage
  • Use an online commercial service such as onlineocr.net
  • Use open-source software such as Tesseract
  • Use the docs2csv script, below

 

5. Everything is in one huge file — split the pages

Our users often receive their documents as a small number of huge files, thousands of pages each. This is especially common with FOIA requests, or when the source material is a stack of paper. In this case (after OCR, if necessary) you will want Overview to analyze your material at the page level, not the file level. Overview can split pages automatically when importing files.

 

6. Convert a disk full of random files to a CSV — run the docs2csv script

Before Overview was able to read most file types, we wrote a little script called docs2csv . This Ruby script will automatically scan a folder for document files in PDF, txt, Microsoft Word, Excel, and Powerpoint, HTML, JPG, and other formats. It will also scan subfolders (if given the -r flag) and even automatically OCR any PDFs which do not contain text (if given the -o flag.) JPG files are always OCR’d.

The result will be a CSV file you can import directly into Overview — or feed into your favorite data tool. Note that this will not preserve the formatting of the files, only the extracted text.

 

What is xkcd all about? Text mining a web comic

I recently ran into a very cute visualization of the topics of XKCD comics. It’s made using a topic modeling algorithm where the computer automatically figures out what topics xkcd covers, and the relationships between them. I decided to compare this xkcd topic visualization to Overview, which does a similar sort of thing in a different way (here’s how Overview’s clustering works).

Stand back, I’m going to try science!

Fortunately the source text file was already in exactly the right format for import. It took less than a minute to load and cluster these 1,299 docs.

The first cluster I found was all the “hat guy” comics. Overview’s phrase detection created a first-level folder for “hat guy” and also threw “beret” in there. Nice, but there’s a lot of non-hat related stuff in that folder too. This other material splits out into its own node two levels down, and seems to be comics about “guys” or “boys” and “girls.” That’s a pretty wide topic as opposed to hat guy comics (it includes a guy-girl duo Christmas special). I removed the guy-girl folder from the hat tag, and the result is shown in green below.

It’s fun to see exactly what each folder contains, because aside from the imported text descriptions of each comic there is conveniently a URL column in the source CSV, which becomes a clickable “source” link in the UI.

Another large first level folder (143 docs) contains comics about “graphs” or “axes”, “chart,” “lines,” etc. This one is a pretty clean folder, in that almost everything in it is one of xkcd’s charts, visualizations, maps, etc. or there is some sort of labelled schematic that appears in one of the panels, like this one. Overview was even able to separate out different types of charts, such as this folder which is mostly bar charts.

Then I started looked though smaller, lower level folders. I quickly found a newscast folder. What’s interesting about this folder is that there is no one word in common between all the newscast comics. But these comics have enough overlap through terms like “news”, “anchor”, and “press” that they get grouped together anyway. I’ve went through each of the 15 docs (open the first doc in the folder, keep pressing next using either the arrow or the “j” key, untag when appropriate) to get an idea of how coherent or not this cluster is. 10 of the 15 are newscasts, as you can see from the orange tag highlight on the node in this image.

The screenshot also shows the programming folder to the right of the newscast folder (11 docs). Again, there is no one term that appears across all these docs. If there was Overview would label the node with “ALL: programmer” or something. Instead we get some “programmer” but also “code” and “algorithm” and “mobile.” Again Overview has succeeded in finding a concept even though there is disparate language.

Topic quality varies throughout the tree, with some tight, interpretable topics and also some large “miscelaneous” folders. Of course you can always type a word into the search field to see exactly where documents containing a particular word ended up in the tree. I put about 30 minutes into this and I’ve tagged about 400 of the 1300 documents. (I could finish the job by using the new show untagged feature.)  So we might get a pretty complete picture of what’s available in the xkcd universe in about 2 hours total. Of course if you need high precision on the tags on individual documents we have to manually check them (select tag, then press “j” repeatedly to scan the docs quickly.) Assigning tags to folders in Overview tends to over-tag somewhat because there is often some miscellaneous stuff in a folder.

How does Overview compare to other topic modeling algorithms?

Many folks have heard of topic modeling algorithms, which are different from but related to Overview’s text analysis. Topic modeling works by automatically assigning one of a predefined number of “topics” to each word in each document, whilst simultaneously figuring out which words should belong to which topic. There are many different topic modeling algorithms but many are based on a technique called Latent Dirichlet Allocation (LDA.) You can get a feel for what LDA does by doing it yourself with pen and paper.

My exploration of xkcd was inspired by a recent LDA analysis of the web comic by Carson Sievert. Here’s what that looks like, as a visualization of the extracted topics and their words (click for larger):

Overview doesn’t derive “topics” directly. Instead it uses multi-level document clustering algorithms based on a standard technique called tf-idf cosine similarity.  We do this because it’s simpler to implement, much faster to run on large document sets, and — we suspect — easier to interpret because each document gets placed in exactly one folder, whereas LDA assigns multiple topics per document. Arguably what Overview does is “topic modeling” since it tries to create topic-themed folders, but that name usually refers to LDA-type algorithms and  I’ve been wondering for some time how Overview’s clustering compares.

The “topics” of an LDA analysis are really just distributions of words, where some words are very common in that topic (perhaps “fish” if the topic is “the ocean”) and others are more rare. LDA topics correspond roughly to Overview’s folders, so let’s see how they compare. I was able to find a few points of reference in the LDA visualization aboive. Topic #1 seems to be all charts. Topic 17 has “hat” and “guy”, though I don’t see “beret” in there. There are many uninterpretable “miscellaneous” topics and a lot of seemingly random words in the tail of each topic. However, these words might make more sense if we could see the source comics easily from the interactive. LDA has many tuning parameters and algorithmic variants, and  it’s possible that it might work especially well for other document sets; it seems to do a nice job on the Sarah Palin emails.

We’ve run into the problem of diversity: Randall Munroe writes about a huge range of different things, as defined by words and phrases that only appear in one or two comics. Also, many of the comics are hard to model since they have little text or feature only relatively generic words like “guy” and “woman.” This is actually a very common situation for document sets (or other high-dimensional data) and LDA and Overview deal with this heterogeneity in different ways. LDA seems to start “modeling the noise” by adding unrelated words to the words-in-a-topic distributions, while Overview ends up generating really miscellaneous folders that don’t resolve into a clear conceptual whole until several levels down the tree, or sometimes not at all.

Ultimately I don’t think the choice of text analysis algorithm is all that important, as long as you have one that works reasonably well. Topic modeling and document clustering are mathematically related anyway. The real trick in document mining is building a system that people can actually understand, trust, and use, as a recent paper from Stanford’s visualization lab makes wonderfully clear. Flexible document import, clear visualizations, rapid tagging, integrated search, easy document viewing —  text mining is about much more than algorithms. Still, we are always exploring new types of analysis and visualization for Overview, so it’s fun to see how different techniques compare.

New: Show all untagged documents

Overview’s tags help you keep track of where you’ve been. Now there’s an easy way to see where you  haven’t been: the Show Untagged button.

Show Untagged appears in the tag window at the bottom of the screen. When you press it, you’ll get a visual display of how many documents in each folder have no tag at all applied, as above (from our xkcd analysis.) This is very useful if you need to exhaustively explore your documents, just to be certain you haven’t missed anything. Of course exhaustive analysis doesn’t mean you must actually read every document. Instead, open each folder down to whatever level of detail you need in order to decide whether the material inside is relevant or not. When you’ve made that choice, tag and move on to the next folder.

Thanks to our users who asked for this feature. If there’s something that would really speed up your work, contact us.

Step-by-step instructions for using Overview

To get started using Overview, you can watch this video or follow the steps below.

 

1. Get your documents into Overview.

  • If all of your documents are in PDF form you can simply upload them directly. Note that documents scanned from paper must first be OCR’d to turn their images into searchable text.
  • If you are a journalist, you can upload your documents to DocumentCloud, a free tool to upload, OCR, search, store, and publish documents in many formats.
  • You can also upload your documents as a CSV file, a type of speadsheet file that you can save from Excel, export from a database system, or create manually in a text editor. Here’s lots more loading documents via CSV.

For more, see the complete guide to getting your documents into Overview.

A useful trick for uploading many documents simultaneously: when the file dialog box opens you can select all of the documents in a folder simultaneously by clicking on the first file, then shift-clicking on the last flie (or pressing Control-A on Windows, or Command-A on Mac).

Overview keeps all uploaded documents private, unless you share them explicitly.

 

2. Explore the documents in the tree view

Overview’s main screen is divided into four parts: the folder tree, search field, tag list, and document viewer.

You can navigate through the folders in the tree with the arrow keys, or by clicking. Each folder is labelled by the keywords that best describe the documents filed under that folder. The label also tells you if MOST, SOME, or ALL of the documents in that folder contain each keyword. A folder’s sub-folders contain, collectively, all of the documents in the parent folder, broken down into increasingly narrow topics.

The document viewer shows either a particular document or a list of selected documents. Each document in the list is summarized by a list of keywords specific to that document.

If you know what you’re looking for, enter your query in the “search” box and Overview will show you where documents containing that term appear in the tree.

The tree automatically expands and zooms to follow your selections. Or you can pan it by dragging with the mouse, and zoom using the +/- buttons or the mouse wheel. Folders marked with ⊕ can be expanded to show sub-folders, while ⊖ hides sub-folders.

 

3. Tag interesting documents
As you explore the folder tree, you’ll run across individual documents or entire folders you want to remember. Enter a descriptive tag in the “new tag” field and press “tag.” If you’re currently viewing a specific document, overview will tag just that document. If instead you’re viewing the list of documents in a folder, Overview will tag the entire folder.

Here’s the complete guide to tagging, including how to export tags to create different types of visualizations.

Tags and folders have independent lives: each document can have any number of tags applied to it, and the same tag can be applied anywhere in the tree.

Once you’ve created a tag, you can add that tag to the current document or document list at any time by pressing the + button that appears when your mouse is over the tag name. Or press – to remove the tag.

Clicking on a tag name selects that tag, highlighting the tagged documents in the tree and loading them into the document list.

 

4. Work your way through the tree
When you have a lot of documents, it pays to be systematic. We recommend working your way through the folders in the tree from left to right — biggest folders to smallest folders. Select a folder then view a few of the documents in it to see if you understand what they have in common. If specific words appear in MOST or ALL documents in a folder, that’s a sign that the folder contains a single meaningful topic. Otherwise there may be more than one important topic in the documents in that folder, so try opening child folders instead until you find a folder where all of the documents are similar. Then tag that folder with a descriptive label.

Use search to find specific documents of interest, but pay attention to which folders contain those documents. You may find other relevant documents in the same folder, even if they don’t contain your search term.

As you proceed, you may find documents that talk about similar topics in different folders. Overview doesn’t know what you want out of your documents, so it can’t always guess how they should be arranged. You can apply a tag to any combination of folders and documents to create a set that is meaningful to you.

You may also discover that the documents in a folder are irrelevant to your work, in which case you can tag them with “read” and simply move on. Part of the power of Overview is being able to decide not to look at an entire folder.

When you’re finished this process, you’ll have a neatly categorized tree, and a set of tags corresponding to all the interesting topics in your documents.

 

5. Learn more!

Overview has many more powerful features: you can automatically split long documents into individual pagesignore meaningless words, compare data to text, and many other things. See the help for more tips and tricks, or contact us to ask about your specific needs!

 

We’re hiring a front-end developer

The Overview Project, an open-source document mining system, is looking for a front end developer

Journalists are increasingly confronted with huge sets of documents that they have to understand quickly. These documents come from Freedom of Information requests, leaks, or open government sites, and consist of thousands or even millions of pages of disorganized documents in any file format. Overview is an open-source tool to help investigative journalists and other curious people find the essential information in a huge document dump.

The software analyzes the full text of each document using natural language processing techniques, automatically sorts documents into topics and sub-topics, and visualizes their content. It has been used to report on emails, declassified archives, tweets, and more.  Overview includes full text search, but unlike a search engine it is designed to help you find what you don’t even know you’re looking for.

We need an additional front-end developer on the team. Overview is written in Scala on the Play framework, with a Coffeescript front end. We’re looking for:

  • Solid JavaScript engineering experience, with modern tools such as jQuery, Coffeescript, and Backbone
  • Experience with a modern MVC web app architecture, such as Rails, Django, or Play
  • It’s open source! Are you good at supporting a developer community?
  • Design and usability sense;  you’ll be making many decisions at the intersection of beauty and function.
  • An understanding of web application architecture. Stuff like AWS and Postgres.
  • Bonus geek points: Experience in visualization, natural language processing, or distributed systems

You’ll be on a small team using agile processes, which means you’ll have a great deal of influence over the product and its architecture. Perks include travel to data journalism conferences and flexible working arrangements. New York City area preferred, but will consider remote. Mostly, we’re looking for someone who cares about making it easier for investigative journalists to do their job. Open data is great, but transparency means nothing if no one is watching.

Overview is an open-source project of the Associated Press, funded by a News Challenge grant from the Knight Foundation.

Please send resumes to jonathan@overviewproject.org

How to process documents that contain more than one language

Overview supports several different languages, but you can only pick one language per document set. Fortunately, there is an easy workaround to analyze a document set that contains several different languages.

The trick is to paste a stop words list into the “words to ignore“ box. Stop words are the short, common, grammatical words in a language such as “a” and “for” in English, or “un” and “soy” in Spanish. Overview automatically ignores the stop words from whatever language you tell it to use. This is necessary, otherwise you would always get a folder labelled “MOST: the” when processing English documents. Overview only removes stop words form one language at a time, but you can get exactly the same effect by pasting in stop words for other languages.

This trick can also be used to process documents in a language that Overview doesn’t officially support yet!

Suppose you have a document set containing English and French text. You can tell Overview that the documents are in English, then paste in a French stop words list in the “words to ignore” box. Separate the words with spaces or put them on different lines. The result should look like this:

You can find stop words lists for many languages here. Simply cut and paste the words for the languages your documents include, as many different languages as you want. (There is no need to paste in stop words for the language you have told Overview to use, as the system adds those stop words automatically.)

This isn’t a perfect technique, because some stop words in one language can be legitimate words in another language, but it will get you 95% of the way there. Most importantly, it will allow you to use Overview on multi-language documents right now, before we develop a more integrated solution. As noted above, you can also process documents in any language, not just the ones Overview supports.