We have successfully enriched the database of 3 million email addresses within the shortest period of time, reaching up to 92% of data accuracy.
Data Augmentation for marketing campaign
Massive data enrichment job implied multi-thread Google Search scraping and proxy rotating services to find the owners of about 3M email addresses.
Those who run online marketing campaigns know – the bigger the database of the potential clients, the greater the chances to sell the lead. The more accurate the list of the contacts is, the fuller information it contains – the greater the chances that the target person actually opens your email.
Thus, data enrichment is an indispensable process in this case, and experienced marketer always recourse to the data mining firms like us.
The goal of this project was to enrich the database of 3 million email addresses with their owners' information. Data augmentation is one of the most popular tasks in data processing for marketing and advertising. Email addresses without owner details are not enough to create high-quality marketing newsletter.
The fuller owner information is used in email, the better chances to get into mailbox and not be filtered out as SPAM. In case you have email owner's first name, last name and job title, you can create a personalized email, which produces the greater chances that the target person will be intrigued enough to open and read an email.
We have used Google to find and pull additional info for data augmentation. When making Google search with a corporate email, in the majority of cases owner details are shown in one of the first three organic snippets. This is where we were grabbing the data from. If personal details are missing in the first three results, we can go to the webpage and scan it trying to find personal details there. To pass irrelevant snippets, we have developed an intellectual system of filters with blacklisted and whitelisted resources.
Another issue is that Google is doing its best to fight scraping, thus blocks abusive IP addresses after a certain number of requests. To resolve this issue, we have used our proprietary system that collects free proxies from the Internet (which in most cases helps our clients to save a significant amount of money, as they don't have to use paid proxy services). Using proxies, we make Google think that the queries are sent from multiple spots and in case a proxy is banned, it’s getting replaced with the other one to ensure continuous scraping process.
Due to a fact that proxy servers require the delays between the requests, we’ve decided to speed the process up by running the search in simultaneous threads (up to 800 hundred).
PDF Extraction & Sentiment Analysis
Extracting unstructured data from PDF files for further sentiment analysis performed with a help of Natural Language Processing - NLP for short.
Tracker and Traffic Redirector
High-load easily scalable traffic management system, capable of processing hundreds of millions requests per day, effectively distinguishing bots and humans requests.