PDF Extraction & Sentiment analysis

Extracting unstructured data from PDF files for further sentiment analysis performed with a help of Natural Language Processing - NLP for short.

pdf processing

Project

One day, we’ve got contacted by the representative of the Texas university, who was working on the research. He was referring to the website that contained the stenography of the US Congress sessions for the period between 1985 and today. All the files on that website were the scan copies of the original documents in PDF format, containing the speeches of various congressmen on different topics, meaning that we had to deal with the unstructured data. So the main goal of this research was to determine the sentiment of every speech provided on that website, as well as identify the speakers, their political parties, location, years in the senate etc.

The other goal (that we’ve in fact come up with while moving towards the first goal) was to develop a user-friendly dashboard for the client to be able to view the analysis results and have a module for the analysis of the similar files.

While developing the dashboard is not that big of a deal, the natural language analysis cannot be considered a routine job. This is where the experienced developers have recourse to NLP (Natural Language Processing) - an area of artificial intelligence, designed for processing and analyzing large amounts of natural language data.

Challenges


  1. The data we had to work with was unstructured, thus it couldn’t be “read” by the script without preliminary conversion using PDF-to-text libraries.
  2. Once we get a plain text, it’s time to connect the NLP system. The detailed description of what NLP is, as well as a thorough comparison of the most popular technologies and toolkits can be found here. So I’ll get straight to the point saying that we’ve selected NLTK library for this specific task.
  3. The last, but the most important part is to put each functionality piece together and develop a robust system that would work like a Swiss watch. Following is a system workflow described in a few simple steps:
  • It receives a PDF file as an input.
  • Converts it into “txt” format (libraries used: pdftotext/xpdf/pdf-miner/pypdf2).
  • Divides the text into the speaker names and the speeches. It does this with a help of regular expressions, that use the results of the preliminary text writing analysis as the delimiter. We’ve analyzed a few articles manually to determine the differences in the text formatting (how’s the speaker name writing is different from the writing of the speech, and how is the comments and other text writing is different from the aforementioned items).
  • Then, it analyzes each text array with NLTK (Vader Lexicon Sentiment Analyzer) to determine the common sentiment of the text. To get more accurate results, we decided to analyze the texts of the different length.
  • Lastly, it saves the results into the database.

2

developers worked on the project

34k

data file

1

months to complete

Solution

The dashboard (built on Laravel) that allows processing arbitrary PDF files, with a help of NLP system, to analyze and determine various data points, and the section for the analysis results overview:

  1. The homepage. Represents the table with several columns:
    • Link to the original PDF file (possible to either view or download it).
    • Speaker’s name.
    • Speaker’s political party.
    • Sentiment score of the speech.
    • Link to the detailed results overview (contains the pieces of the paragraphs that have been processed through NLTK and the sentiment scores breakdown).
  2. The page with the list of the speakers. Clicking each name leads you to the page with the detailed information about each speaker (his short biography, political party, years in Senate and so on).
  3. The page with the file upload form. Allows uploading the new files for further processing through NLTK.

This was not an ordinary project, as it required an advanced level of expertise and a non-standard approach. Nevertheless, it had been successfully driven to completion by the team of experienced developers and had fulfilled the high expectations of the person who really needed this system to conduct his thorough research successfully.

Laptop with code

Similar Projects

Schedule a meeting

We’ll invite you to join us in teleconference at the time you pick

Schedule

Describe your project

We will calculate its cost shortly and get back to you with the development plan

Write

Chat with our manager

Use the chat window at the right side of your screen