Visual document mining for journalists

Getting your documents into Overview — the complete guide

The first and most common question from Overview users is how do I get my documents in? The answer varies depending the format of your material. There are three basic paths to get documents into Overview: as multiple PDFs, from a single CSV file, and via DocumentCloud. But there are several other tricks you might need, depending on your situation.

1. The simple case – a folder full of PDFs

If your documents are searchable PDFs the process is easy. Choose “Upload PDF files” and add each document using the file selection dialog box. You can add all files in a folder in one step by clicking on the first file, then scrolling down and shift+clicking on the last file. There are two potential snags: the PDFs must be searchable, and individual documents must be in individual files. If your PDFs are scanned documents or all documents are in a small number of huge PDFs, see below.

 

2. Journalist collaboration – DocumentCloud project import

DocumentCloud is collaborative document hosting service operated exclusively for professional journalists. It supports upload, OCR, search, annotation, publishing of documents, which may be public or private. Overview can import a DocumentCloud project directly, and DocumentCloud’s annotation tools are available directly within the Overview document viewer. This method is a good choice if you have access to DocumentCloud and the document set isn’t too large, up to a few thousand documents.

 

3. Documents as data – a CSV file

Overview can read an entire document set as a single CSV file, with one row per document containing the document text and optional information such as URL, tags, and unique ID.  This might sound more complex than it is; the simple format is documented here. A CSV file is a good choice if you need to bring data in from another application, such as a social media monitoring tool. You also use a CSV file to import tags, which can be used to compare text to data.

 

4. Scanned images –  OCR the documents

Not all PDFs are created equal. Some are actually images, not text, which frequently happens when the source material was paper. To see if your PDF file actually contains text, try searching within the file or cutting and pasting text. If this doesn’t work, your documents need to be converted to text, a process that goes by the delightfully anachronistic name of Optical Character Recognition (OCR). You have several options:

  • Go through DocumentCloud, which has built-in OCR
  • Use a commercial package such as Abby or Omnipage
  • Use an online commercial service such as onlineocr.net
  • Use open-source software such as Tesseract
  • Use the docs2csv script, below

 

5. Everything is in one huge PDF — split the pages

Our users often receive their documents as a small number of huge PDF files, thousands of pages each. This is especially common with FOIA requests, or when the source material is a stack of paper. In this case (after OCR, if necessary) you will want Overview to analyze your material at the page level, not the file level. Overview can split pages automatically when importing from DocumentCloud (we are working on adding this for PDF import.)

 

6. A disk full of random files — run the docs2csv script

The most complex case is folders and sub-folders full of documents of many different file formats, some of which may need OCR. Eventually we hope to have Overview handle this material directly. Meanwhile, if you are willing to spend a little time at the command line you can use the docs2csv script. This Ruby script will automatically scan a folder for document files in PDF, txt, Microsoft Word, Excel, and Powerpoint, HTML, JPG, and other formats. It will also scan subfolders (if given the -r flag) and even automatically OCR any PDFs which do not contain text (if given the -o flag.) JPG files are always OCR’d.

The result will be a CSV file you can import directly into Overview. Only the extracted text of the document will be displayed, not the original document images. But, you can enable the “source file” links in the document view by running a local http server, as described in the readme. This isn’t the simplest way to get your documents loaded, but it is by far the most flexible and powerful. It relies on the Apache Tika library to read documents so it’s easy to add even more formats.

 

What is xkcd all about? Text mining a web comic

I recently ran into a very cute visualization of the topics of XKCD comics. It’s made using a topic modeling algorithm where the computer automatically figures out what topics xkcd covers, and the relationships between them. I decided to compare this xkcd topic visualization to Overview, which does a similar sort of thing in a different way (here’s how Overview’s clustering works).

Stand back, I’m going to try science!

Fortunately the source text file was already in exactly the right format for import. It took less than a minute to load and cluster these 1,299 docs.

The first cluster I found was all the “hat guy” comics. Overview’s phrase detection created a first-level folder for “hat guy” and also threw “beret” in there. Nice, but there’s a lot of non-hat related stuff in that folder too. This other material splits out into its own node two levels down, and seems to be comics about “guys” or “boys” and “girls.” That’s a pretty wide topic as opposed to hat guy comics (it includes a guy-girl duo Christmas special). I removed the guy-girl folder from the hat tag, and the result is shown in green below.

It’s fun to see exactly what each folder contains, because aside from the imported text descriptions of each comic there is conveniently a URL column in the source CSV, which becomes a clickable “source” link in the UI.

Another large first level folder (143 docs) contains comics about “graphs” or “axes”, “chart,” “lines,” etc. This one is a pretty clean folder, in that almost everything in it is one of xkcd’s charts, visualizations, maps, etc. or there is some sort of labelled schematic that appears in one of the panels, like this one. Overview was even able to separate out different types of charts, such as this folder which is mostly bar charts.

Then I started looked though smaller, lower level folders. I quickly found a newscast folder. What’s interesting about this folder is that there is no one word in common between all the newscast comics. But these comics have enough overlap through terms like “news”, “anchor”, and “press” that they get grouped together anyway. I’ve went through each of the 15 docs (open the first doc in the folder, keep pressing next using either the arrow or the “j” key, untag when appropriate) to get an idea of how coherent or not this cluster is. 10 of the 15 are newscasts, as you can see from the orange tag highlight on the node in this image.

The screenshot also shows the programming folder to the right of the newscast folder (11 docs). Again, there is no one term that appears across all these docs. If there was Overview would label the node with “ALL: programmer” or something. Instead we get some “programmer” but also “code” and “algorithm” and “mobile.” Again Overview has succeeded in finding a concept even though there is disparate language.

Topic quality varies throughout the tree, with some tight, interpretable topics and also some large “miscelaneous” folders. Of course you can always type a word into the search field to see exactly where documents containing a particular word ended up in the tree. I put about 30 minutes into this and I’ve tagged about 400 of the 1300 documents. (I could finish the job by using the new show untagged feature.)  So we might get a pretty complete picture of what’s available in the xkcd universe in about 2 hours total. Of course if you need high precision on the tags on individual documents we have to manually check them (select tag, then press “j” repeatedly to scan the docs quickly.) Assigning tags to folders in Overview tends to over-tag somewhat because there is often some miscellaneous stuff in a folder.

How does Overview compare to other topic modeling algorithms?

Many folks have heard of topic modeling algorithms, which are different from but related to Overview’s text analysis. Topic modeling works by automatically assigning one of a predefined number of “topics” to each word in each document, whilst simultaneously figuring out which words should belong to which topic. There are many different topic modeling algorithms but many are based on a technique called Latent Dirichlet Allocation (LDA.) You can get a feel for what LDA does by doing it yourself with pen and paper.

My exploration of xkcd was inspired by a recent LDA analysis of the web comic by Carson Sievert. Here’s what that looks like, as a visualization of the extracted topics and their words (click for larger):

Overview doesn’t derive “topics” directly. Instead it uses multi-level document clustering algorithms based on a standard technique called tf-idf cosine similarity.  We do this because it’s simpler to implement, much faster to run on large document sets, and — we suspect — easier to interpret because each document gets placed in exactly one folder, whereas LDA assigns multiple topics per document. Arguably what Overview does is “topic modeling” since it tries to create topic-themed folders, but that name usually refers to LDA-type algorithms and  I’ve been wondering for some time how Overview’s clustering compares.

The “topics” of an LDA analysis are really just distributions of words, where some words are very common in that topic (perhaps “fish” if the topic is “the ocean”) and others are more rare. LDA topics correspond roughly to Overview’s folders, so let’s see how they compare. I was able to find a few points of reference in the LDA visualization aboive. Topic #1 seems to be all charts. Topic 17 has “hat” and “guy”, though I don’t see “beret” in there. There are many uninterpretable “miscellaneous” topics and a lot of seemingly random words in the tail of each topic. However, these words might make more sense if we could see the source comics easily from the interactive. LDA has many tuning parameters and algorithmic variants, and  it’s possible that it might work especially well for other document sets; it seems to do a nice job on the Sarah Palin emails.

We’ve run into the problem of diversity: Randall Munroe writes about a huge range of different things, as defined by words and phrases that only appear in one or two comics. Also, many of the comics are hard to model since they have little text or feature only relatively generic words like “guy” and “woman.” This is actually a very common situation for document sets (or other high-dimensional data) and LDA and Overview deal with this heterogeneity in different ways. LDA seems to start “modeling the noise” by adding unrelated words to the words-in-a-topic distributions, while Overview ends up generating really miscellaneous folders that don’t resolve into a clear conceptual whole until several levels down the tree, or sometimes not at all.

Ultimately I don’t think the choice of text analysis algorithm is all that important, as long as you have one that works reasonably well. Topic modeling and document clustering are mathematically related anyway. The real trick in document mining is building a system that people can actually understand, trust, and use, as a recent paper from Stanford’s visualization lab makes wonderfully clear. Flexible document import, clear visualizations, rapid tagging, integrated search, easy document viewing —  text mining is about much more than algorithms. Still, we are always exploring new types of analysis and visualization for Overview, so it’s fun to see how different techniques compare.

New: Show all untagged documents

Overview’s tags help you keep track of where you’ve been. Now there’s an easy way to see where you  haven’t been: the Show Untagged button.

Show Untagged appears in the tag window at the bottom of the screen. When you press it, you’ll get a visual display of how many documents in each folder have no tag at all applied, as above (from our xkcd analysis.) This is very useful if you need to exhaustively explore your documents, just to be certain you haven’t missed anything. Of course exhaustive analysis doesn’t mean you must actually read every document. Instead, open each folder down to whatever level of detail you need in order to decide whether the material inside is relevant or not. When you’ve made that choice, tag and move on to the next folder.

Thanks to our users who asked for this feature. If there’s something that would really speed up your work, contact us.

Step-by-step instructions for using Overview

To get started using Overview, you can watch this video or follow the steps below.

 

1. Get your documents into Overview.

  • If all of your documents are in PDF form you can simply upload them directly. Note that documents scanned from paper must first be OCR’d to turn their images into searchable text.
  • If you are a journalist, you can upload your documents to DocumentCloud, a free tool to upload, OCR, search, store, and publish documents in many formats.
  • You can also upload your documents as a CSV file, a type of speadsheet file that you can save from Excel, export from a database system, or create manually in a text editor. Here’s lots more on how to prepare a CSV file for Overview.

For more, see the complete guide to getting your documents into Overview.

A useful trick for uploading many documents simultaneously: when the file dialog box opens you can select all of the documents in a folder simultaneously by clicking on the first file, then shift-clicking on the last flie (or pressing Control-A on Windows, or Command-A on Mac).

Overview keeps all uploaded documents private, unless you share them explicitly.

 

2. Explore the documents in the tree view

Overview’s main screen is divided into four parts: the folder tree, search field, tag list, and document viewer.

You can navigate through the folders in the tree with the arrow keys, or by clicking. Each folder is labelled by the keywords that best describe the documents filed under that folder. The label also tells you if MOST, SOME, or ALL of the documents in that folder contain each keyword. A folder’s sub-folders contain, collectively, all of the documents in the parent folder, broken down into increasingly narrow topics.

The document viewer shows either a particular document or a list of selected documents. Each document in the list is summarized by a list of keywords specific to that document.

If you know what you’re looking for, enter your query in the “search” box and Overview will show you where documents containing that term appear in the tree.

The tree automatically expands and zooms to follow your selections. Or you can pan it by dragging with the mouse, and zoom using the +/- buttons or the mouse wheel. Folders marked with ⊕ can be expanded to show sub-folders, while ⊖ hides sub-folders.

 

3. Tag interesting documents
As you explore the folder tree, you’ll run across individual documents or entire folders you want to remember. Enter a descriptive tag in the “new tag” field and press “tag.” If you’re currently viewing a specific document, overview will tag just that document. If instead you’re viewing the list of documents in a folder, Overview will tag the entire folder.

 

Tags and folders have independent lives: each document can have any number of tags applied to it, and the same tag can be applied anywhere in the tree.

Once you’ve created a tag, you can add that tag to the current document or document list at any time by pressing the + button that appears when your mouse is over the tag name. Or press – to remove the tag.

Clicking on a tag name selects that tag, highlighting the tagged documents in the tree and loading them into the document list.

 

4. Work your way through the tree
When you have a lot of documents, it pays to be systematic. We recommend working your way through the folders in the tree from left to right — biggest folders to smallest folders. Select a folder then view a few of the documents in it to see if you understand what they have in common. If specific words appear in MOST or ALL documents in a folder, that’s a sign that the folder contains a single meaningful topic. Otherwise there may be more than one important topic in the documents in that folder, so try opening child folders instead until you find a folder where all of the documents are similar. Then tag that folder with a descriptive label.

Use search to find specific documents of interest, but pay attention to which folders contain those documents. You may find other relevant documents in the same folder, even if they don’t contain your search term.

As you proceed, you may find documents that talk about similar topics in different folders. Overview doesn’t know what you want out of your documents, so it can’t always guess how they should be arranged. You can apply a tag to any combination of folders and documents to create a set that is meaningful to you.

You may also discover that the documents in a folder are irrelevant to your work, in which case you can tag them with “read” and simply move on. Part of the power of Overview is being able to decide not to look at an entire folder.

When you’re finished this process, you’ll have a neatly categorized tree, and a set of tags corresponding to all the interesting topics in your documents.

 

5. Learn more!

Overview has many more powerful features: you can automatically split long documents into individual pagesignore meaningless words, compare data to text, and many other things. See the help for more tips and tricks, or contact us to ask about your specific needs!

 

We’re hiring a front-end developer

The Overview Project, an open-source document mining system, is looking for a front end developer

Journalists are increasingly confronted with huge sets of documents that they have to understand quickly. These documents come from Freedom of Information requests, leaks, or open government sites, and consist of thousands or even millions of pages of disorganized documents in any file format. Overview is an open-source tool to help investigative journalists and other curious people find the essential information in a huge document dump.

The software analyzes the full text of each document using natural language processing techniques, automatically sorts documents into topics and sub-topics, and visualizes their content. It has been used to report on emails, declassified archives, tweets, and more.  Overview includes full text search, but unlike a search engine it is designed to help you find what you don’t even know you’re looking for.

We need an additional front-end developer on the team. Overview is written in Scala on the Play framework, with a Coffeescript front end. We’re looking for:

  • Solid JavaScript engineering experience, with modern tools such as jQuery, Coffeescript, and Backbone
  • Experience with a modern MVC web app architecture, such as Rails, Django, or Play
  • It’s open source! Are you good at supporting a developer community?
  • Design and usability sense;  you’ll be making many decisions at the intersection of beauty and function.
  • An understanding of web application architecture. Stuff like AWS and Postgres.
  • Bonus geek points: Experience in visualization, natural language processing, or distributed systems

You’ll be on a small team using agile processes, which means you’ll have a great deal of influence over the product and its architecture. Perks include travel to data journalism conferences and flexible working arrangements. New York City area preferred, but will consider remote. Mostly, we’re looking for someone who cares about making it easier for investigative journalists to do their job. Open data is great, but transparency means nothing if no one is watching.

Overview is an open-source project of the Associated Press, funded by a News Challenge grant from the Knight Foundation.

Please send resumes to jonathan@overviewproject.org

How to process documents that contain more than one language

Overview supports several different languages, but you can only pick one language per document set. Fortunately, there is an easy workaround to analyze a document set that contains several different languages.

The trick is to paste a stop words list into the “words to ignore“ box. Stop words are the short, common, grammatical words in a language such as “a” and “for” in English, or “un” and “soy” in Spanish. Overview automatically ignores the stop words from whatever language you tell it to use. This is necessary, otherwise you would always get a folder labelled “MOST: the” when processing English documents. Overview only removes stop words form one language at a time, but you can get exactly the same effect by pasting in stop words for other languages.

This trick can also be used to process documents in a language that Overview doesn’t officially support yet!

Suppose you have a document set containing English and French text. You can tell Overview that the documents are in English, then paste in a French stop words list in the “words to ignore” box. Separate the words with spaces or put them on different lines. The result should look like this:

You can find stop words lists for many languages here. Simply cut and paste the words for the languages your documents include, as many different languages as you want. (There is no need to paste in stop words for the language you have told Overview to use, as the system adds those stop words automatically.)

This isn’t a perfect technique, because some stop words in one language can be legitimate words in another language, but it will get you 95% of the way there. Most importantly, it will allow you to use Overview on multi-language documents right now, before we develop a more integrated solution. As noted above, you can also process documents in any language, not just the ones Overview supports.

 

PDF upload: the easiest way yet to get your documents into Overview

Your big pile of documents might arrive in many different forms — from a stack of paper to an archive of random files. But PDFs are a popular document file format that every reporter, researcher or analyst has to work with sooner or later. Now you can upload them directly into Overview

Just choose “Upload PDF files” from the “Import Documents” menu, then “Add files” to open a file selection box. You’ll want to upload more than one file of course, so you can select all files in a directory by pressing Control-A (Windows) or ⌘-A (Mac). Or you can select multiple specific files using the keyboard and mouse in the usual way (if you’re not familiar with how to do that, here are instructions for Windows and Mac)

You can press “Add Files” as many times as you like. Overview will begin uploading  files as soon they are added, and then proceed to clustering and visualization when you press “Done Adding Files” to set the import options (such as language and words to ignore.)

Overview treats each file as “document,” so if you have one long PDF with many documents within it, you will want to split them first. There are several free tools to do this, both web-based and command line. We will soon add the ability to split documents into pages automatically, as is already possible when importing from DocumentCloud. Also, Overview does not (yet!) do OCR, which is the process of making a scanned image searchable. If you can search your PDFs or cut and paste text from them, they do not need OCR and Overview will be able to handle them. Otherwise, you can use commercial products such as Abby and Omnipage to do the OCR on your own computer before uploading.

Other ways to get your documents into Overview

You can still import documents in several other ways:

  • You can import a DocumentCloud project
  • Your can upload a CSV file. CSV files are a general data transfer format that most applications can write, including SQL databases, Microsoft Excel, and social media data sources like DataSift or Radian 6.
  • We’ve also written a powerful script which will scan a folder for documents in many different formats, automatically OCR if needed, and produce a CSV file for Overview. Check out docs2csv if this fits your problem.

 

Documents not sorted the way you’d like? Try ignoring words

Overview categorizes documents by looking for those words that best separate one group of documents from all the others. If many documents contain the word “cat” but most do not, Overview will create a “cat” folder.  This works great until it doesn’t — maybe you don’t care about cats. That’s why  you can tell Overview to ignore any specific word when sorting your documents.

You can do this by typing in the words to ignore, in the import options.

There are many cases where Overview might sort documents based on words you don’t care about:

  • emails might be sorted based on who sent them, when you’re more interested in what’s in them
  • forms might be sorted based on the questions, rather the answers
  • documents might be sorted based on administrative blather (versions, copyright, disclaimers…)
  • If you got the documents by searching for A, B, and C, they might end up just sorted into A, B, and C folders.
  • maybe you just don’t care about X, but Overview made a folder for it anyway

Overview is prone to making these mistakes because it examines every single word (and two word phrases) when deciding how to file a document. An alternate approach is to use entity extraction which only looks for recognizable people, places, organizations, dates, etc. But entity extraction is often unreliable and doesn’t capture all sorts of other meaningful words, like verbs. You probably want to know when many documents have the words “paid” or “killed” in them.

More fundamentally, Overview doesn’t understand your story. It really has no idea what a “meaningful” organization of the documents might be. All it can do is find patterns of language usage between documents. Not only does the computer lack basic human understanding, even a very smart computer can’t get inside your head: what is interesting depends on what you, the analyst, thinks is interesting.

It might seem like the answer is to tell Overview what you care about, but one of the central ideas of Overview is that you shouldn’t have to know what’s in the documents before you look at them — otherwise it’s not possible to discover the unexpected. Instead, if you think that Overview is over-emphasizing the obvious or the trivial, you can tell the computer what isn’t important.

 

Comparing text to data by importing tags

Overview sorts documents into folders based on the topic of each document, as determined by analyzing every word in each document. But it can also be used to see how the document text relates to the date of publication,  document type, or any other field related to each document.

This is possible because Overview can import tags. To use this feature, you will need to get your documents into CSV file, which is a simple rows and columns spreadsheet format. As usual, the text of each document does in the “text” column. But you can also add a “tags” column which gives the tag or tags to be initially assigned to each document, separated by commas if more than one.

To demonstrate, let’s look at a portion of the Afghanistan War Logs. The original file CSV has over 70,000 documents, each of which has many columns as described in the header row:

uid,date,type,category,tracking number,title,text,region,attack on, ...

Looking at the data, the “type” field takes on only a few different values, such as “enemy action” and “explosive hazard.” Let’s use Overview to see how the content of each report — the actual text — aligns with the report type.

To do this, I edited the first row in the CSV file to change the “type” field to a “tags” field:

uid,date,tags,category,tracking number,title,text,region,attack on, ...

Rather than trying analyze several years worth of data at once, I also used a simple script to filter the rows by date, extracting the 3,078 documents from July 2009. (Overview currently has a limit of 50,000 documents per document set, and anyway it’s often useful to take specific subsets of big sets for close analysis.)

You can get this final edited file here. When it is loaded into Overview, the incident types automatically appear as tags.

Here I’ve selected the “Explosive Hazard” tag. You can see that most of the documents with this tag appear on the right side of the tree. But Overview doesn’t look at the tags when sorting documents into folders, just the text. Therefore, there is a pattern in how the text of a report relates to its type. More precisely, there is a correlation between the “text” and “type” fields.

It’s pretty easy to understand why in this case. If you look at the folders on the right side of the tree, you’re see they are labelled with words like “IED” and “found.” The authors of the reports used different language to describe incidents that involved an explosive device, relative to incidents that did not. Conversely, the documents tagged “Enemy Action” mostly end up on the left side of the tree. The other categories have much smaller numbers of documents and tend to appear grouped together in small folders much farther down the tree.

You can use these imported tags for several purposes:

  • to find where certain types of documents ended up in the tree
  • to determine what type of documents are in a particular folder of interest
  • to check that Overview is dividing your documents into meaningful folders
  • to see the relationships between  text and data

This last use — looking for correlations between text and data — is a powerful possibility. For example, you could feed publication year into the “tags” field to analyze how document topics changed over the decades.  Or you could use a “sex” tag to see if documents about men are different than documents about women. The possibilities are endless.

 

Overview now supports multiple languages!

By popular demand, Overview’s natural language processing now supports Spanish, French, German, and Swedish. [Update: now also Dutch, Italian, Arabic and Russian]

You can select your language as part of the document set import options, like this:

(You can also select the language when importing CSV files.)

Selecting the language changes the list of “stop words” that Overview uses during its text processing, that is, words like “the” in English that don’t tell you anything about the contents of the document. Because of this, it’s fairly easy to add new languages — contact us if you’ve got a set of documents in some language not supported here, and we can probably get a new language up and running in a day or two.

Note that this is the actual document processing, and the user interface will still be in English. But the application itself could appear in other languages too, if someone wants to translate the string file.

See also: how to import documents that contain more than one language.