Data mining
Email scraping
Instagram scraping
Web scraping

How to Scrape
100 Million Emails
from Instagram

single blog background

Scraping emails is a quick, easy way to generate lists of potential leads and customers. Profiles may also contain other valuable information that you can use for businesses purposes.

Let’s see how to scrape over 100 million emails from Instagram.

What is Scraping, and Which Social Networks We Can Scrape?

Any web resource can have value. Information is valuable.

“He who owns information owns the world” is a simple truth. Without the right information, it’s difficult to become a successful business person. (this can also be said of politicians and other professionals).

When it comes to social media scraping, the opportunities are wide-ranging. Nowadays, it’s hard to find anyone who is not on social media. By creating accounts on these platforms, users give away their personal details, contact information, and they usually allow their images to be part of the public domain. By tracking these, along with their likes, favourites and followed accounts, you have a window into their world.

All known social media platforms are being scraped by individuals and companies in search of information they can use to better reach their business goals.

Why Businesses Scrape Instagram

In this article, we will talk about Instagram scraping, and what are the goals behind this type of scraping. As of January 1, it’s estimated Instagram has around 1 billion active users, this includes around 25 million business accounts. More than a third part of Instagram users have made purchases using their mobile devices which is a staggering 70% more often than those not using Instagram.

Information available depends on which type of account is studied. Business accounts for example, contain email information (supplied by default) and telephone numbers (optional). While personal accounts don’t contain this contact information.

You can collect email addresses and phone numbers for marketing purposes , to get sales leads or to get a pool of people interested in your competitors’ brands. Using this information, companies often use“cold calling” or a “warm approach” to reach out to potential customers.

By getting profiles from the recommendation feed it’s easy to reach your target audience with greater accuracy. Another way to use Instagram data is to create content based on different topics, as it’s possible to scrape posts based on hashtags. You can play with the words and word order in hashtags to cover researched topics in detail.

How to Scrape Instagram: Using External Software, Hiring a Professional or Using An API Extension
There are several ways to scrape Instagram. You can use Chrome extensions, purchase software, or turn to web scraping experts, depending on the scale of data you need.

Technical Side: How to Execute Scraping

During scraping, you can target profiles, hashtags, and places, and Instagram will return the top 100 posts. There is even a nice internal API endpoint that can be used to get the results in JSON format:
The context query parameter serves as a filter and it can contain a place, user or a hashtag. The only limitation is that the endpoint returns only 100 results. If you require more results, you will need to enter a more refined filter.

When opening any public Instagram page that contains posts (e.g. profile, hashtag or place), Instagram will return an HTML page with the first few posts preloaded (probably using React server-side rendering). Then, as you scroll down the page, Instagram will continue loading more posts using an XHR request to an Instagram’s GraphQL endpoint. The endpoint is protected with a token, so it’s not really possible to access it directly and we need to infinitely scroll the page. However, we can automate the infinite scrolling efficiently using headless Chrome with Puppeteer.

A typical method to retrieve data from Instagram is to search for hashtags, get posts related to the publications with these hashtags, gather posts IDs, and usernames. Usernames we can return with the API.
The final step is the retrieval of emails and contact phone numbers. Based on this, we can see that the scraping process resembles a tree, where from small branches we’re able to move up to bigger branches and gather more information. Gathering data in this way creates a snowball effect.

When we scrape Instagram, we always start with hashtags, acquiring them either from the client or generating them ourselves and combining the words. This allows us to get posts and users that are using these hashtags. From these posts, we gather user IDs and usernames.

If the task is to focus on a narrower target group, for example, only women, or users in a specific age range, you will find that it isn’t possible to filter by age and gender. This is because these fields are not mandatory when users register on Instagram.

Hypothetical solutions to this can be ML/AI that uses face recognition techniques to filter males and females, however, this solution is not 100% accurate.

single blog background

Scraping Technical Details

Hashtags can be parces with just one parcer, as the Instagram feed’s endpoint is protected with a token. Pagination is based on the site’s infinite scroll. The comment section loading also gets triggered by an XHR request to Instagram’s GraphQL endpoint.

There are two ways you can get to the information: login or login. Doing this without logging in means you can scrape hashtags and usernames. For this,only proxies and server resources are necessary.

To access more valuable data like emails and phone numbers from business accounts, you will need to be logged in. We do this through the API and purchase users. We buy 100 users for roughly $55. Users are verified through two-factor identification (using emails, not phones).

You would need a simple admin panel to organise the process, where you can insert either usernames or hashtags. We built ours using Laravel.

 

Costs of Scraping 100m Emails: Money – and Timewise

What are the limitations and difficulties while scraping Instagram?

There are speed limitations involved when logging into each user, so you need a high-quality proxy. Twelve proxies cost us $45 per month, and they allow for maintaining 100-150 authorized users.

In terms of time, the execution speed changes. We always start with the testing mode.

We don’t run full speed, as we want to monitor how the process evolves and apply the necessary technical settings. This stage can be a slow process.

Next, we scale up and accelerate after tweaking technical settings. During this stage, we scrape larger amounts of data at a higher speed. Unfortunately we then reach a point where it becomes harder to follow the tree structure flow and the speed declines.

For example, we have 40% of Instagram already scraped, the following 35% will be significantly slower leaving the final 25% that will probably never get done.

Proxies

The amount of proxies depend on the amount that Instagram allows. From time to time, Instagram’s API can ban (either temporarily, or permanently). The amount of enquiries is hard to detect. Several times we were able to scrape on 300 flows simultaneously, but this can change rapidly.

By using 180 accounts, we can parse 1,5 million accounts per day. We are now on the third stage, and parse 200 thousand per day, even though there were no serious changes. We are aware that Instagram constantly improves and upgrades its anti parsing measures for personal data. Instagram rarely publishes these improvements and upgrades.

Expenses

To get 4 million accounts per month, it costs us around $500 for scraper accounts, servers, proxies and software. That comes out to 1 cent per 8 emails.

Why is Scraping Instagram Not Easy?

Instagram can ban accounts at any time. A ban can be temporary or permanent. In the case of a permanent ban, you will need to start the whole process over again. And this is just one problem. Another issue is the fact that the feed for a taken hashtag is going to be expanded by the latest, newly published publications.

The current API can be removed at any time and replaced by two new ones, too. What happens in this case? Almost all libraries were gone from Github after one update except the one written in C#. You’re at the mercy of Instagram, and the changes they can make at any time. A simple, refined process is best for scraping Instagram.

Wrapping Up

Instagram’s 1+ billion users are active, based across the world and are the consumers of businesses worldwide. When you scrape Instagram, you’re able to use this data to target businesses and consumers with great precision.

And at such a low cost, scraping Instagram makes sense for businesses.
Using a professional team that has experience in the scraping process helps save you time, money and internal resources.