Visualization to connect the dots

What’s coming next for Overview

We’ve been doing a lot of user testing recently, and for May and June month we’re focusing on integrating your feedback. Lots of big stuff coming in the next month or so.

Integrated search.

This is high on the list of user requests. Overview will shortly have a full-text search field: type in your search word or phrase, and the system will highlight all the documents in the tree that contain that term, and load those documents into the document list. If you find something interesting, you can turn your search results into a tag with one click.

New document list.

It hasn’t always been obvious that the lower left panel of the main screen is a list of documents, showing all the documents in the currently selected folder or tag. Nor has it been clear how to go through the documents in this list quickly (At the moment, the fastest way is to use the j and k keys.) We’re going to have a much more descriptive list of documents, including title, text snippet, and suggested tags (see below) for document in the list.

Suggested tags.

Currently, each folder and document is described by a list of characteristic words, the words that make that folder or document different from all the others. We’re expanding this concept into a list of suggested tags, with a super fast way to apply them to a document, or all documents in the list.

Tablet support — and a new look!

Any way you slice it, going through a document set involves a lot of reading. We’re optimizing the UI for reading on tablets, with a new full screen document mode, and an interface designed for touch. Plus, we’re completely redesigning the site with a clean look that’s easy on the eyes.

We’re very excited about these upcoming changes, because they will make the system much faster and easier to use. The easiest way to know when they arrive is to follow @overviewproject on Twitter.

VIDEO: Text Analysis in Transparency – a talk at Sunlight Labs

Video of talk at Sunlight Labs at the Sunlight Foundation, Washington DC.

This talk is about how text analysis and natural language processing is being used in journalism, open government, and transparency generally.

The first part of the talk is a survey of existing public projects, and the algorithms behind them, including

  • Churnalism detects plagiarism in the news (or press releases!)
  • Many Bills automatically classifies the sections of bills, to detect pork barrel projects
  • Docket Wrench analyzes the comments on proposed regulations
  • NewsDiffs watches for changes in published articles
  • FEC Standardizer automatically cleans campaign donor names
  • MemeTracker tracks political quotes across the whole web, as they mutate

And of course, there’s a brief demonstration of Overview, and a discussion of the algorithms behind it. (First time here? See how Overview works for investigative journalists)

Finally, there’s a great a discussion of where data-driven transparency is going now — or, what should we work on next? How do we know we are working on the right data sets and the right tools? How can we evaluate the impact of transparency projects? The talk ends with a throwdown — the Transparency Grand Challenge!

How Overview can organize thousands of documents for a reporter

Before computers, all document-driven stories started with a big stack of paper. Often, the first task was to organize all that paper, by sorting individual documents into piles by type. This gives journalists a high-level idea of “what’s in there” and helps them decide what to read more closely — and just as importantly, what isn’t worth reading.

Today a computer can organize your documents for you. That stack of paper may now be a folder full of PDF files, but it still doesn’t come with any sort of built-in index or obvious categorization system. This is exactly the problem that Overview solves: It splits documents into piles based on their subjects, and then splits each pile into even more specific sub-piles, and so on. The result is a tree of folders.

Above is part of the folder tree that Overview automatically built for the 6,849 documents containing every mention of the city of “Caracas” within the diplomatic cables released by Wikileaks. (Click on the image for a larger version.) Overview labels each folder by the key words in the documents inside. The top folder here has words like PDVSA (the Venezuelan state oil company), “oil”, “billion,” “company” and “production,” so it’s mainly documents concerning the oil industry and other big business. Other top-level folders in this document set (not shown) concern embassy politics, elections, and military operations.

The top-level folder about oil splits into two sub-folders. The one on the left concerns the oil industry specifically, while the one on the right is more about banks and finance. The oil industry folder splits further into regional issues (the Petrocaribe consortium) and documents about PDVSA specifically. Each folder splits into smaller and smaller sub-folders, each of which contains a smaller number of documents on a more specific topic. To let you know when the documents in a folder are getting very specific, Overview tells you when “MOST” or “ALL” of the documents in that folder contain a particular word.

How a computer understands topics

When I show this to reporters, their first question is always, how does the computer do that? It’s more than curiosity: If you’re going to rely on a computer to organize your documents, you’re asking a machine to help you decide what you should and shouldn’t read. The integrity of the reporting process demands that we understand what our algorithms are doing.

All document categorization algorithms are based on the ability to compare two documents to tell how similar they are. A group of documents which are all very similar to one another belong in the same folder. Computers don’t understand human language, so they need a simple mechanical process which takes two documents as input — literally just the sequence of words that make up the text of each document — and generates a number which is small if the documents are very different, and large if the documents concern the same topic.

Some text analysis systems, such as Open Calais, are based on “named entity recognition,” which extracts people, places, organizations, dates, etc. from the documents. Then, we can say that two documents are similar if they talk about the same entities. This is useful, but such systems will miss important generic words like “oil” and “production.” Instead, Overview examines every word of every document. In a sense, it reads the full text, so you don’t have to.

Comparing two documents based on their full text

Suppose we have filed an FOIA request for a classified storybook for the children of CIA operatives, and after a long legal battle. The government has given us copies of these three secret documents:

  • “The cat sat on the mat. Then the cat chased the rat.”
  • “The cat slept all day on the mat.”
  • “The rat ran across the floor.”

First, Overview strips capitalization, punctuation, and the grammar words such as “the,” “a,” “on,” etc. These words, also called stop words in natural language processing, aren’t useful for determining the topic of the text, because they appear in almost every document. This leaves us with:

  • “cat sat mat cat chased rat”
  • “cat slept all day mat”
  • “rat ran across floor”

You can see that most of the sense of the document is still there, despite removing lots of words. Then, Overview counts how many times each word appears in each document, producing a word frequency table, like this:

This throws out the order of the words, which means the computer can’t understand the difference between “soldiers shot civilians” and “civilians shot soldiers.” This may seem very simplistic, but surprisingly, decades of information retrieval research show that word order usually doesn’t matter when all you want to know is the topic of a document.

Then Overview compares every pair of documents to check how similar they are. It does this by counting the number of words which appear in both documents, but with a twist: If a word appears twice in one document, it’s counted twice. In other words, we multiply the frequencies of corresponding words, then add up the results. This is the final similarity score.

In this case, the two documents about the cat have a similarity of 3: Cat appears twice in the first document and once in the second, plus rat appears once in each document. The document about the rat has no words in common with the document about the cat sleeping on the mat, so the similarity score is zero.

Documents which are similar enough end up in the same folder, and the folder is labelled by the words which make those documents different from all the others. In this case, the folder is labeled by “cat” and “mat” because those words don’t appear in the remaining document about the rat.

And that’s the heart of it. This description omits a number of details for simplicity, but includes all of the things a reporter needs to know:

  • Overview uses the full text of each document.
  • It is not sensitive to word order.
  • Documents with overlapping words are placed in the same folder.

If you’d like to understand the process more deeply, here are a few more details: Overview actually processes text in two word bigrams, not just single words, so it can detect people’s names and other short phrases. Rather than just simple term counts, it weights each word by how rare it is in the document set overall, using a classic formula called TF-IDF. And to generate the folders, given the similarity between every pair of documents, Overview uses k-means clustering, splitting folders recursively at each level of the tree.

Try it on your own documents

Overview is available for free at overviewproject.org. It can automatically import your projects from the popular DocumentCloud repository, which also handles document upload, OCR, and other tasks. Or, you can upload a CSV file if your text is already in spreadsheet or database format. It also works great on social media data, such as a collection of tweets or blog posts.

You can learn to use Overview by watching a short video on the help page, or viewing the webinar recorded at Poynter’s NewsU.

Dealing with massive PDFs by splitting them into pages

Our users frequently face a situation where the natural document boundaries are lost. If we’re going to be analyzing 50,000 emails, ideally each email would be stored in its own file. But very often, the source material arrives as a series of massive PDF files, each of which may be thousands of pages long. Or the documents may arrive as a big stack of paper, which becomes a single massive PDF after scanning.

It can be challenging to recover the original document boundaries within a file where everything runs together. For my story on private security contractors in Iraq, I solved this problem with a custom script that tried to detect cover pages. This was a difficult and time-consuming solution, but fortunately there is an easy trick that works well in most cases: split each long document into pages, and then have Overview sort the large number of pages, not the small number of original documents.

Starting today, Overview will do this automatically when importing from DocumentCloud, when you select the new “split each document into pages” checkbox:

AP reporter Jack Gillum was the first to suggest this trick, and used it when analyzing 9,000 pages of documents concerning then-Vice Presidential candidate Paul Ryan. In that case, there were many different kinds and lengths of documents within the huge stack of paper he received. Manual splitting was out of the question because it would have been much too time consuming, and there was no easy way to automate the task. Making Overview sort “pages” instead of “documents” was the simple solution, and it worked great.

This might seem counter-intuitive; how could it possibly work to ignore the boundaries of the original documents? But in fact “document” is a somewhat vague term. When we’re looking through a lot of material, we need to chose a “unit of analysis,” and the ideal unit might not be a “document,” especially if the documents are long. For example, we might want to analyze books at the level of chapters, or contracts at the level of paragraphs, or legislation at the level of sections. A “page” may not conform to any such natural division, but it’s a conveniently sized chunk of text to compare to other, similarly sized chunks; if you can imagine that it would be useful to break the source material into pages and then sort the pages by topic, then this splitting trick will work. It’s not perfect, but it’s simple, fast, and works on any type or length of document.

For the moment, this feature only works when importing a project from DocumentCloud. If you are using Overview without DocumentCloud, then it’s up to you up to you to split your source material into documents in whatever way makes sense, placing one “document” in each row of the input CSV. If your source material comes in PDF form, you may want to check out DocSplit.

How to use Overview to analyze social media posts

Even when 10,000 people post about the same topic, they’re not saying 10,000 different things. People talking about an event will focus on different aspects of it, or have different reactions, but many people will be saying pretty much the same thing. People posting about a product might be talking about one or another of its features, the price, their experience using it, or how it compares to competitors. Overview’s document sorting system groups similar posts so you can find these conversations quickly, and figure out how many people are saying what.

This post explains how to use Overview to quickly figure out what the different “conversations” are, how many people are involved in each, and how they overlap.

Step 1: Get the data into Overview

Overview does not capture or scrape social media. Fortunately, there are lots of other monitoring tools to do that, such as Radian 6, Sysomos, and Datasift. Overview can read data exported from each of these tools as CSV files. This is a simple file format used by spreadsheet programs, so if you can open it in Excel, you should be able to import it to Overview. Or, you can create your own files in this format.

Then upload the file into overview. Select “Import a New Document Set” as usual, then choose “CSV upload” and select your file. Overview will check the file for errors and show a preview of the file, like this:

If it all looks good, click Upload.

Step 2: Explore your documents

Overview reads the full text of each post — not just names and keywords — and groups similar documents together into different folders. Then it takes each folder, and groups similar documents into sub-folders, and so on. The folder tree shows how Overview has organized your posts.

There won’t always be strictly one type of “conversation” per folder, because conversation is a fuzzy concept, and Overview doesn’t know how you want to categorize things. But all the documents in the folder will be the same in some way — or if the folder seems to include too many different types of posts, try exploring the sub-folders, which have narrower topics.

As you select each folder, Overview shows a list of the posts in that folder in the bottom left. Each post is summarized by the most “characteristic” words in that post — not the words that are most common, but the words that make the documents in that folder different from the documents in other folders.

It’s the tags you create that define which conversations you’re interested in. These might be very broad categories such as “positive” and “negative,” or much more specific tags such as “didn’t like the color.” Generally, you’ll explore the documents top-to-bottom and left-to-right, inventing and assigning tags as you go. You can assign tags to one document at a time if you like, but usually you’ll want to assign a tag to all documents in the folder at the same time. To do this, select the folder, then move the mouse over the tag and press the + button.

Tags are non-exclusive, meaning that each document can have as many tags as you like.

Step 3: Export your data

Now that you’ve seen what’s in your data, and tagged it for future reference, you probably want to use these results in some way. You can see how many documents have each tag just by selecting that tag. Or, you can export your entire document set, including the tags you just added. The export button is on the document set list page. It creates a spreadsheet (CSV file) and gives you a choice how you want your tags.

Each document will be one row in the speradsheet. If you put all tags in one column they will be separated by commas, like “Sunshine, Positive.” But putting each tag in its own column makes it easy to do things like find all the documents that have a specific tag, or take column totals to get the number of documents with each tag — useful for making a visualization of your results.

We’re hiring a front-end developer

The Overview Project, an open-source document mining system, is looking for a front end developer

Overview is a reading machine, a way to make sense of a large volume of documents, whether those are emails, public records, or tweets. Originally designed for investigative journalists to analyze document dumps, it is also used by social media agencies, non-profits, and others who need to understand what thousands or millions of people are saying. This web-based software reads the full text of every document, then automatically visualizes the documents as categories and sub-categories.

There is no faster way to understand what’s in those 10,000 pages you got back from a FOIA request, or those 100,000 tweets you painstakingly captured. Unlike a search engine, Overview helps you find what you don’t even know you’re looking for. Overview is a project of the Associated Press, funded by a Knight News Challenge grant. The product is in beta at www.overviewproject.org.

We need an additional front-end developer on the team. Overview is written in Scala on the Play framework, with a Coffeescript front end. We’re looking for:

  • Solid JavaScript engineering experience, with frameworks such as jQuery, Coffeescript, and Backbone
  • Experience with a modern MVC web app architecture, such as Rails, Django, or Play
  • Familiarity with open source development; both the tools (e.g. Github) and the culture
  • Design and usability sense; we have a designer, but you’ll still be making many decisions at the intersection of beauty and function.
  • An understanding of web application architecture, including server operation and SQL database design
  • Experience in visualization, natural language processing, or distributed systems a plus.

Perks include travel to data journalism conferences and flexible working arrangements. New York City area preferred, but will consider remote.

But mostly, we’re looking for someone who wants to build something they care about. Open data is great, but transparency means nothing if no one is watching. Our goal is to make it easy and cheap to sort through large volumes of unstructured text.

Send resumes to jonathan@overviewproject.org

Using Overview without DocumentCloud, and how to import text from a CSV

Overviewproject.org can now import a document set from a CSV file. This is standard, simple format for tabular data that many programs can read and write. For example, Excel can save a spreadsheet as a CSV file. This allows use of Overview without uploading the documents to DocumentCloud, and makes it much easier to import data from sources such as Twitter.

There are two basic steps to loading a CSV file into Overview: get your data in the correct format, and upload it. You may also need to extract text from a set of PDFs.

Getting your data into the correct format

A CSV file is simply a list of “comma-separated values,” organized into rows and columns, like a spreadsheet or a table. The file starts with a list of the column names, separated by commas. This is followed by each row of data, one row per line, with the values for each column again separated by commas. Overview only requires one column, which much be named “text.” Here is an example file:

text
This is the content of the first document.
And here is the text of document the second
Document three talks about quick brown foxes.
.
.

If the text of a document spans multiple lines, or itself contains commas, then it needs to be quoted. Quotes inside a quoted document must be “escaped” by turning them into double quotes. This is all standard CSV stuff, and any program or library that writes CSVs should do it automatically.

text
"This document is really long and crosses multiple lines and contains
commas, which is why it is quoted."
"This is the second document. I'd like to say ""Hi!"" to everyone
to show how to put quotes inside a quoted document. The text of
this document can cross as many lines as needed, or even contain
blank lines like this:

The second document ends with this final quote."
The third document fits all on one line so no quotes needed.
"The fourth document has a comma in it, so it's quoted too."
.
.

And that’s it. Overview will display the text in the viewer pane when you click on each document. If you want to display something else for the document, you can add a “url” column which tells Overview to load a particular web page when you view that document. For security reasons, this must be an https URL. Here’s an example using tweets:

text,url
New deploy today -- cleaner clustering, better handling of larger document sets. Anyone got a pile of PDFs they want to look at? Try it!,https://twitter.com/overviewproject/status/281075194557259777
"""“I’m not going to sit out on the newsroom floor and sort pages into stacks of documents"" ~@jackgillum on need for document mining software.",https://twitter.com/overviewproject/status/264450385928929280
.
.

It’s also possible to add a unique ID column, simply named “id”, which Overview will read and associate with the document, which is how documents will be referenced when you export tags (coming soon.)

Uploading your CSV file to Overview

First, select the upload option from the main document set list page:

Then choose a file. Overview will show a preview and do some basic checks to ensure that the format is OK. It should look like this:

You can also tell Overview what character encoding the file uses. Try changing this if you see funny square characters in the preview, or accents aren’t displaying right. Then hit upload, and away we go. You can use Overview as usual on the document set.

Creating a CSV for Overview from a collection of PDFs

Overview does not currently support viewing a collection of PDFs without DocumentCloud in an integrated way. However, there is a workaround, based on a tool from the prototype version. You will need some familiarity with the command line to do this. First install Git and download the prototype, then use the loadpdf script to extract the text from a folder full of PDF files and create a CSV suitable for uploading into Overview. This process is described in the documentation for the prototype.

Unfortunately you will not be able to view the original PDFs within Overview without putting them on a web server somewhere and then modifying the URL column to point to the location of each document. We’re working on it.

 

 

Document mining shows Paul Ryan relying on the the programs he criticizes

One of the jobs of a journalist is to check the record. When Congressman Paul Ryan became a vice-presidential candidate, Associated Press reporter Jack Gillum decided to examine the candidate through his own words. Hundreds of Freedom of Information requests and 9,000 pages later, Gillum wrote a story showing that Ryan has asked for money from many of the same Federal programs he has criticized as wasteful, including stimulus money and funding for alternative fuels.

This would have been much more difficult without special software for journalism. In this case Gillum relied on two tools: DocumentCloud to upload, OCR, and search the documents, and Overview to automatically sort the documents into topics and visualize the contents. Both projects are previous Knight News Challenge winners.

But first Gillum had to get the documents. As a member of Congress, Ryan isn’t subject to the Freedom of Information Act. Instead, Gillum went to every federal agency — whose files are covered under FOIA — for copies of letters or emails that might identify Ryan’s favored causes, names of any constituents who sought favors, and more.

Bit by bit, the documents arrived — on paper. The stack grew over weeks, eventually piling up two feet high on Gillum’s desk. Then he scanned the pages and loaded them into the AP’s internal installation of DocumentCloud. The software converts the scanned pages to searchable text, but there were still 9000 pages of material.

That’s where Overview came in. Developed in house at the Associated Press, this open-source visualization tool processes the full text of each document and clusters similar documents together, producing a visualization that graphically shows the contents of the complete document set.

“I used Overview to take these 9000 pages of documents, and knowing there was probably going to be a lot of garbage or extra attachements, to separate the chaff from the wheat,” said Gillum. Much of Ryan’s correspondence is standard congressional work, communicating with constituents about their particular problems and issues. “I could figure out where are the letters from voters, and to to put these documents in groups. So if someone’s complaining about the FCC, and there are 200 pages about that, we can put that aside.”

DocumentCloud supports key word search, but search won’t always tell the full story. First, much of the material was of such low quality, such as copies of faxed letters, that the OCR process that converts a scanned image into searchable text often produced incorrect results. This means that a literal search will miss documents. Second, searching will not help you find stories that you don’t know you are looking for, a problem that gets worse as the number of documents grows. You need something like a table of contents to avoid that problem, which is what Overview provides.

In this case, Overview was able to group letters signed by Ryan, by recognizing certain standardized language in the header and footer, even when that text was sometimes garbled by the OCR process.  ”It found a cluster of the documents that Ryan had written over his signature,” said Gillum.

Tools like DocumentCloud and Overview are rapidly becoming essential as reporters are forced to deal with ever increasing amounts of information. It is not unusual for a single request for government files to produce thousands or even tens of thousands of pages of material, far too much to read exhaustively.

“Using these sorts of tools is essential as we go forward, looking at big document sets, to provide readers with some insight into how government works,” said Gillum.

“I’m not going to sit out on the newsroom floor and sort pages into stacks of documents,” he said.

VIDEO: document mining with Overview

With the release of the new, web-only version of Overview that runs in your browser, we thought it was time to make a little video showing how to use it.

If that doesn’t answer your questions, see also the help page, and the FAQ.

How to Use Overview to Explore A Document Set

It takes just a few minutes to start exploring your documents in Overview. Overview depends on DocumentCloud to store, OCR, and publish documents, so you will need a DocumentCloud account (here’s how to get one.)

1. Batch upload your documents to a DocumentCloud project
Log in to your DocumentCloud account Create a project to store all of your files, using the “New Project” button. Then select “New Documents.” Now here’s the trick to batch uploads: when the file dialog box opens, you can select all of the documents in a folder simultaneously by clicking on the first, then shift-clicking on the last (or pressing Control-A on Windows, or Command-A on Mac). You can keep the documents private if you like.

2. Log into Overview and import your project
Go to overviewproject.org and log in, or create an account. Select “import your project from DocumentCloud” and enter your DocumentCloud username and password when prompted. Your DocumentCloud projects will appear. Select the project that you want to explore, and get a coffee while Overview imports and analyzes it.

3. Explore the documents in the tree view

Overview’s main screen is divided into four parts: the topic tree, the tag list, the document list, and the document viewer.

The topic tree view displays your documents sorted into the topics and sub-topics that Overview has automatically created for your documents. The big node at the top contains all documents. It splits into several smaller nodes below, each of which contains  documents on similar topics. The nodes are different sizes, because sometimes Overview finds many documents on a similar topic, while in other cases a document is so unique that Overview puts it into a node all by itself.

You can pan the tree left and right by dragging with the mouse, or moving the scroll bar. You can zoom into the tree by using the mouse wheel, two fingers on the trackpad, or dragging one end of the scroll bar. Nodes which have a small ⊕ in the center can be expanded to show children, while ⊖ hides children.

Each node is labelled by the top keywords from the documents in that node. These words tell you the topic of the node. The children of a node contain, collectively, all of the documents in the parent, but broken down into more specialized topics.

When you select a node, the documents in it appear in the document list. Each document in the list is represented by a list of keywords specific to that document. Clicking on a document on the list loads it in the document viewer.

4. Tag interesting documents
As you explore the topic tree, you’ll run across individual documents or entire nodes you want to remember. Enter a descriptive tag in the “new tag” field and press “tag.” The currently selected documents will be tagged, and a little tag color swatch will appear next to them in the document list.

Once you’ve created a tag, you can add the currently selected documents to it at any time by pressing the + button that appears when your mouse is over the tag name.  (To tag an entire node at once, select the node and then press the +/- button.) Or press – to remove the tag.

Clicking on a tag name selects that tag,  highlighting the tagged documents in the tree and loading them into the document list.

5. Work your way through the tree
When you have a lot of documents, it pays to be systematic. We recommend working your way through the nodes in the tree from left to right — biggest topics to smallest topics. Select a node, then view a few of the documents in it to see if you understand what they have in common. If there seems to be more than one important topic in the documents in that node, try opening up the child nodes instead, until you find a node where all of the documents are more or less the same. Then, tag that node with a descriptive label.

As you proceed, you may find documents in the same topic in different nodes. Overview doesn’t know what story you are working on, so it can’t always guess how the documents should be arranged. You can apply a tag to any combination of nodes and documents to create a set that is meaningful to you.

You may also  discover that the documents in a node are irrelevant to your story, in which case you can tag them with “read” and simply move on. Part of the power of Overview is being able to decide not to look at an entire topic.

When you’re finished this process, you’ll have a neatly categorized tree, and a set of tags corresponding to all the interesting topics in your documents.

6. Ask for help!
Questions? Bugs? Something you’d like to see in a future version! Contact us!