Contact Information Scraper & NLP Analysis
Working closely with multiple US educational institutions, we have been asked once to find and collect email addresses, names and job titles of people working in the US universities. Having the list of the websites of the US universities provided upfront, we had to access each of them and scrape all emails that could be possibly found.
Determine the university department to link an email owner to it.
Except for the names and emails, the client also wanted to know the department that the scrapped person was working at. Thus, we’ve come up with the following rather simple plan:
1. To define a potential department link, all links scraped from the website page are checked against keywords in the URL text or tag parameters.
2. Popular and widely used keywords, like news/blog/calendar are excluded.
3. If keywords have been found in the URL, the link is cut right after the URL part that contains the keyword. For instance, for ‘engineering’ keyword, from two links that have been found:
we will pull
4. If a potential department link is a link on a subdomain that belongs to university webpage, we only treat subdomain as a department. For instance, http://art.qwe.edu/news/campus-art-gallery
In this case, university link is http://qwe.edu and http://art.qwe.edu was defined as a department link.
Determine the name of the email owner and his job title.
Searching the page for emails is not a complicated task and it is done by means of complex scanning of tags and processing the text with the help of regular expressions (scripts are written in Python using Scrapy framework).
After that we have added specific methods that clean up pulled emails from encoded characters (urlencoded) or convert encrypted emails into usual format. For instance, from JS String.fromCharCode or from emails with @ replaced by -at-, (at), _at and similar, which is done to hide emails from scrapers.
Now, once we have detected an email address on the page, how do we pull such information as the email owner name and his job title? Being a human, you can easily find needed details somewhere next to the email address among the lines of some side data. To teach the program code to do things like that, we implement Natural Language Processing techniques.
There are two handy libraries – NLTK and Stanford NER – that are specifically designed to work with the text. In fact, the text next to emails is being analyzed, split into the logical parts and, as a result, scraped for the required data points with help of the corresponding methods provided by the aforementioned libraries.
developers worked on the project
month to complete
Scraper has successfully processed around 6K educational websites that belong to the universities in the US, collected about 3 mln emails and 200K departments connected to each other so data enrichment task was fulfilled successfully.
Schedule a meeting
We’ll invite you to join us in teleconference at the time you pick
Describe your project
We will calculate its cost shortly and get back to you with the development plan
Chat with our manager
Use the chat window at the right side of your screen