In today’s work environment PDF documents are widely used for exchanging business information, internally as well as with trading partners. Naturally, you’ve seen quite a lot of PDFs in the form of invoices, purchase orders, shipping notes, price-lists etc. Despite serving as a digital replacement of paper PDF documents present a challenge for automated manipulation with data they store. It is as accessible as data written on a piece of paper since some PDFs are designed to transfer information to us, humans, but not computers. Such PDFs can contain unstructured information that does not have a pre-defined data model or is not organized in a pre-defined manner. They are typically text-heavy and may contain a mix of figures, dates and numbers.
With the majority of available tools very often you have to process the entire PDF document, having no option to limit the data extraction to a specific section where the most valuable data lies in. However, some PDF table extraction tools do just that. Sad to say that even if you are lucky enough to have a table structure in your PDF it doesn’t mean that you will be able to seamlessly extract data from it.
For example, let’s take a look at the following text-based PDF with some fake content. It has quite noticeable and distinguishing (although borderless) rows and columns:
With only minor inspection you could have missed one important pattern: text at the intersection of some rows and columns is stacked and shifted so that it could hardly be recognized as the additional feature of the same data row.
Nonetheless, any data that does not fit nicely into a column or a row is widely considered unstructured, we can identify this particular real-world phenomenon as semi-structured data.
Which does not make it easier to parse data from a given table for any out-of-box extracting algorithm. While those tools may have reasonably efficient results, in this particular case we require extra development effort to fit your requirements. Moving forward with this tutorial you’ll find a non-trivial solution to this challenge.
Scope of this tutorial
In this tutorial you will learn how to:
- Use out-of-box solutions to extract tables from PDF
- Get a raw text from PDF with the authentic document layout
- Perform text manipulations with numpy and pandas
More generally you will get a sense of how to deal with context-specific data structures in a range of data extracting tasks.
Out-of-box-solutions for table extraction
To affirm the truth of the above statements we’ll try to parse our semi-structured data with ready-made Python modules, specially assigned to extract tables from PDFs. Among the most popular out-of-box algorithms are camelot-py and tabula-py. They both showed themselves to be effective in many complicated contexts. Let’s see how they meet our challenge:
Seems like our initial choice has turned into a miserable failure! While tabula-py appears to be slightly better in detecting a grid layout of our table, it still provides a lot of extra work to split the text in a second column, not saying it has completely kicked off the last ‘hanging’ row of the original table.
As to the output of camelot-py — it is obvious that all the relevant information about interposition of text in columns has completely vanished.
From this moment on, we’ll proceed with building our custom parsing algorithm.
Processing raw text from PDF with an authentic layout
To begin with, we need a basis for our custom algorithm to work on. This should be a string input, fully representing the layout of the original document. Again you have quite a few options (think of python modules) to choose from. In our tutorial, we’ll decide on pdfminer and pdftotext to experiment with. Here is the output of their work:
It looks like both modules have produced quite satisfactory results, except for pdftotext has stripped some whitespace between 5th and 6th arbitrary columns.
But the most essential is that in both cases we were able to preserve visible intervals between blocks of text aligned vertically (hello monospace font!)
Amazing! Our next step will be to move those stacked text-blocks within ‘hanging’ (even) rows to the right and up onto (⤴) the same level with a correspondent odd rows where they authentically belong to. Alongside we’ll separate the first row as the header of our table to deal with it individually:
Matching columns to original headers
Have you noticed how naturally and consistently our table got doubling headers for additional columns, such as 3–3 and 5–5?
To further refer all the columns to original headers of the table we’ll perform the following renaming trick: we’ll assign all the doubling entries new unique subtitles (postfixes) respect to their parent (leftmost) elements. Such as a sequence [0, 1, 2, 3, 3, 4 …] will become [0, 1, 2, 3, 3_1, 4 …].
The somewhat similar transformation we’ll perform on original headers of the table (being separated from data on the previous step). It is assumed you have already noticed one particular pattern there: all the columns with stacked text have ‘/’ in their titles, which denotes the name of a column we’ve got from data unstacking. To match columns with their titles we need some kind of a lookup table for headers to refer to and perform the final touch:
The above-described method should be considered as an ad-hoc solution for extracting data from text-based PDFs when you know exactly where the table is. Though you can try to adopt another solution to parse semi-structured data from PDFs on a large scale.
See the full code on GitHub.