//!-- >">'>

Getting your documents into Overview — the complete guide

The first and most common question from Overview users is how do I get my documents in? The answer varies depending the format of your material. There are three basic paths to get documents into Overview: as multiple files, from a single CSV file, and via DocumentCloud. But there are several other tricks you might need, depending on your situation.

 

1. The simple case – a folder full of documents

Overview can directly read many file types, such as PDFs, Word, PowerPoint, RTF, text files, and so on. Choose “Upload document files” and add each document using the file selection dialog box.

If you are using the Chrome browser, you can also select entire folders at once. If you are not using Chrome, you can add all files in a folder in one step by clicking on the first file, then scrolling down and shift+clicking on the last file.

There are two potential snags: if you are uploading PDFs files, they must be searchable. Overview will not OCR files that are scanned images. Also, you may need to split long documents into individual pages. See below for details on both of these cases.

 

2. Journalist collaboration – DocumentCloud project import

DocumentCloud is collaborative document hosting service operated exclusively for professional journalists. It supports upload, OCR, search, annotation, publishing of documents, which may be public or private. Overview can import a DocumentCloud project directly, and DocumentCloud’s annotation tools are available directly within the Overview document viewer. This method is a good choice if you have access to DocumentCloud and the document set isn’t too large, up to a few thousand documents.

 

3. Documents as data – a CSV file

Overview can read an entire document set as a single CSV file, with one row per document containing the document text and optional information such as URL, tags, and unique ID.  This might sound more complex than it is; the simple format is documented here. A CSV file is a good choice if you need to bring data in from another application, such as a social media monitoring tool. You also use a CSV file to import tags, which can be used to compare text to data.

 

4. Scanned images –  OCR the documents

Not all PDFs are created equal. Some are actually images, not text, which frequently happens when the source material was paper. More about that. To see if your PDF file actually contains text, try searching within the file or cutting and pasting text. If this doesn’t work, your documents need to be converted to text, a process that goes by the delightfully anachronistic name of Optical Character Recognition (OCR). You have several options:

  • Go through DocumentCloud, which has built-in OCR
  • Use a commercial package such as Abby or Omnipage
  • Use an online commercial service such as onlineocr.net
  • Use open-source software such as Tesseract
  • Use the docs2csv script, below

 

5. Everything is in one huge file — split the pages

Our users often receive their documents as a small number of huge files, thousands of pages each. This is especially common with FOIA requests, or when the source material is a stack of paper. In this case (after OCR, if necessary) you will want Overview to analyze your material at the page level, not the file level. Overview can split pages automatically when importing files.

 

6. Convert a disk full of random files to a CSV — run the docs2csv script

Before Overview was able to read most file types, we wrote a little script called docs2csv . This Ruby script will automatically scan a folder for document files in PDF, txt, Microsoft Word, Excel, and Powerpoint, HTML, JPG, and other formats. It will also scan subfolders (if given the -r flag) and even automatically OCR any PDFs which do not contain text (if given the -o flag.) JPG files are always OCR’d.

The result will be a CSV file you can import directly into Overview — or feed into your favorite data tool. Note that this will not preserve the formatting of the files, only the extracted text.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>