How about… Tabula?
We have all been there: Trying to copy and paste tables out of PDF-files. It’s stressful, exhausting and in many cases a mission impossible. There are different ways to get the data out of a PDF-file though, one of them is the use of the simple yet powerful open-source tool Tabula.
How to get startedYou can download and install Tabula as a dektop application and immediately start using it. There is no installation necessary. Important note in advance: Tabula only works with text-based PDF – scanned documents or files created from images will not work (unless you use OCR while scanning or as an intermediate process).
Your first step is to upload the file, you have to extract the data from. Sometimes it seems that Tabula is a bit slow, and yes, speed is dependant on the size of your file. But it is in general not the speediest tool in the shed. When the uploading is completed, you can browse through the document and either use the feature “autodetect tables” or manually highlight the relevant elements. The automated process didn’t work so well in our test runs, by not covering the whole table or tables, so you are better off selecting the desired data manually instead.
Happy with what you see? Then you can start exporting. Tabula converts the data in your preferred format such as CSV, JSON, TSV and Script, dropping the results in your download folder. If you’re into data-scraping, you will most likely appreciate the more sophisticated formats such as Script or JSON. For all others, who prefere what-you-see-is-what-you-get, CSV will do best. Just open the format in Excel and you will immediately see
the tables as shown in the preview. In most our test-runs the displayed data was quite accurate, only needing some minimal adjustment, such as eliminating white columns or other minor errors within the Tables. But that means a second look is definitely advisable.
What we think about it
Tabula is a neat little tool. It’s easy to use and does what it’s supposed to do. It might not be perfect and could offer a few more options (like an eliminating the need for OCR intermediate steps with other tools), but it’s definitely better than copy and pasting tables from PDFs manually.
|What we like…||…and dislike|