The first and most common question from Overview users is how do I get my documents in? The answer varies depending the format of your material. There are three basic paths to get documents into Overview: as multiple PDFs, from a single CSV file, and via DocumentCloud. But there are several other tricks you might need, depending on your situation.
1. The simple case – a folder full of PDFs
If your documents are searchable PDFs the process is easy. Choose “Upload PDF files” and add each document using the file selection dialog box. You can add all files in a folder in one step by clicking on the first file, then scrolling down and shift+clicking on the last file. There are two potential snags: the PDFs must be searchable, and individual documents must be in individual files. If your PDFs are scanned documents or all documents are in a small number of huge PDFs, see below.
2. Journalist collaboration – DocumentCloud project import
DocumentCloud is collaborative document hosting service operated exclusively for professional journalists. It supports upload, OCR, search, annotation, publishing of documents, which may be public or private. Overview can import a DocumentCloud project directly, and DocumentCloud’s annotation tools are available directly within the Overview document viewer. This method is a good choice if you have access to DocumentCloud and the document set isn’t too large, up to a few thousand documents.
3. Documents as data – a CSV file
Overview can read an entire document set as a single CSV file, with one row per document containing the document text and optional information such as URL, tags, and unique ID. This might sound more complex than it is; the simple format is documented here. A CSV file is a good choice if you need to bring data in from another application, such as a social media monitoring tool. You also use a CSV file to import tags, which can be used to compare text to data.
4. Scanned images – OCR the documents
Not all PDFs are created equal. Some are actually images, not text, which frequently happens when the source material was paper. To see if your PDF file actually contains text, try searching within the file or cutting and pasting text. If this doesn’t work, your documents need to be converted to text, a process that goes by the delightfully anachronistic name of Optical Character Recognition (OCR). You have several options:
- Go through DocumentCloud, which has built-in OCR
- Use a commercial package such as Abby or Omnipage
- Use an online commercial service such as onlineocr.net
- Use open-source software such as Tesseract
- Use the docs2csv script, below
5. Everything is in one huge PDF — split the pages
Our users often receive their documents as a small number of huge PDF files, thousands of pages each. This is especially common with FOIA requests, or when the source material is a stack of paper. In this case (after OCR, if necessary) you will want Overview to analyze your material at the page level, not the file level. Overview can split pages automatically when importing from DocumentCloud (we are working on adding this for PDF import.)
6. A disk full of random files — run the docs2csv script
The most complex case is folders and sub-folders full of documents of many different file formats, some of which may need OCR. Eventually we hope to have Overview handle this material directly. Meanwhile, if you are willing to spend a little time at the command line you can use the docs2csv script. This Ruby script will automatically scan a folder for document files in PDF, txt, Microsoft Word, Excel, and Powerpoint, HTML, JPG, and other formats. It will also scan subfolders (if given the -r flag) and even automatically OCR any PDFs which do not contain text (if given the -o flag.) JPG files are always OCR’d.
The result will be a CSV file you can import directly into Overview. Only the extracted text of the document will be displayed, not the original document images. But, you can enable the “source file” links in the document view by running a local http server, as described in the readme. This isn’t the simplest way to get your documents loaded, but it is by far the most flexible and powerful. It relies on the Apache Tika library to read documents so it’s easy to add even more formats.