How about… Tabula?

We have all been there: Trying to copy and paste tables out of PDF-files. It’s stressful, exhausting and in many cases a mission impossible. There are different ways to get the data out of a PDF-file though, one of them is the use of the simple yet powerful open-source tool Tabula.

How to get started

Tabula - no installation needed

You can download and install Tabula as a dektop application and immediately start using it. There is no installation necessary. Important note in advance: Tabula only works with text-based PDF – scanned documents or files created from images will not work (unless you use OCR while scanning or as an intermediate process).

Your first step is to upload the file, you have to extract the data from. Sometimes it seems that Tabula is a bit slow, and yes, speed is dependant on the size of your file. But it is in general not the speediest tool in the shed. When the uploading is completed, you can browse through the document and either use the feature “autodetect tables” or manually highlight the relevant elements. The automated process didn’t work so well in our test runs, by not covering the whole table or tables, so you are better off selecting the desired data manually instead.

Selecting the appropriate tables form the PDF

The preview of the extracted data gives you the opportunity to fix errors, revise the selected cells and add or remove selections. Once you’re done with selection, there are two ways to extract the data: The Stream method and the Lattice method. Those are two different algorithms, that reconstruct the table structure out of the data you want to extract. The Lattice method uses the lines on the page, whereas the Stream method focuses on the positions of the actual words. Usually, Tabula automatically runs both algorithms and chooses the one, which fits better, but sometimes the numbers are still not displayed correctly, so it is advisable to try to switch between the two methods to see what works better.


Clearing up

Happy with what you see? Then you can start exporting. Tabula converts the data in your preferred format such as CSV, JSON, TSV and Script, dropping the results in your download folder. If you’re into data-scraping, you will most likely appreciate the more sophisticated formats such as Script or JSON. For all others, who prefere what-you-see-is-what-you-get, CSV will do best. Just open the format in Excel and you will immediately see
the tables as shown in the preview. In most our test-runs the displayed data was quite accurate, only needing some minimal adjustment, such as eliminating white columns or other minor errors within the Tables. But that means a second look is definitely advisable.

Et voila - the (almost) clean data from the PDF.

What we think about it

Tabula is a neat little tool. It’s easy to use and does what it’s supposed to do. It might not be perfect and could offer a few more options (like an eliminating the need for OCR intermediate steps with other tools), but it’s definitely better than copy and pasting tables from PDFs manually.

What we like… …and dislike
  • relatively easy to handle. You do not have to know how to code in order to work the data (but you definitely need to know how to use Excel at least).
  • selecting the data manually is quite easy
  • you can always go one step back and revise the selected data.
  • Sometimes Tabula is a bit slow. The developers recommend to use tabula-extractor, which is a lot harder to use for beginners, especially for those without any knowledge of coding.
  • Tabula can not extract data out of scanned PDF-documents.

