The Client Story
Two years ago GroupBWT engaged in a long-term collaborative project with a major legal player in the US market. The project’s aim is to collect data from Walmart and Amazon. This data relates to the extensive lists of selected keywords and products.
Technology used: Laravel, Scrapy Python, Puppeteer, MySQL, mSQL, RabbitMQ
Industry: | Retail and e-commerce |
---|---|
Cooperation: | Since 2018 |
Location: | USA |
We streamlined the process allowing data scraping to be executed in 100 -150 streams simultaneously. We ended up extracting data for up to 4 million products from Walmart
We synchronized with the external Azure SQL database, and launched the sync. Overall, 20 mln reviews were successfully collected from Amazon
Brands strive to be protected online by getting their sales controlled.
High level competition has led to the adoption of numerous technological solutions based on data. Over the past decade, brands have undergone a digital transformation, going omnichannel, closing brick and mortar stores, and switching their sales to the online channels. If you don’t have an online presence, the future of your brand or company may be at a disadvantage. However, growth of online presence has led to the competiiton violation. The client tied with us to get a solution to combat unauthorized sellers, control and grow online sales, achieve MAP compliance, eliminate channel conflicts, and protect brand value and customer experience.
We devised multiple scrapers, and built an admin panel to interact with the client.
This allowed us to exchange data in a more efficient way. The scraping was triggered on keywords the client was uploading into the admin panel. Our job was to scrape sellers and products related to these keywords.That allowed us to scrape 4 million products from Walmart and 20 mln reviews from Amazon platforms. Scraping of giant platforms like Walmart and Amazon is not just a tough nut to crack due to the amount of products and pages, but also because such websites adopt strict measures to limit the practice of scraping. It is not always clear when or if a process is delivering, as the product and catalogue pages differ in their structure and can confuse the scraper logic.
The challenge was not to build just a crawler, but a crawler that would run smoothly due to the vast amount and variety of input data that it would be exposed to. This crawler needed to be highly resilient, this was achieved by applying a combination of request scheduling techniques and IP rotation. This was to avoid the identifiable bot behavior patterns. Listed below are some precautionary measures we followed throughout the process:
• IP randomization
• IP addresses that are within the reasonable proximity from the store
• Keeping the chosen IPs for the scraping session
• The proxy pool changes every 24 hours.
Walmart applies the AJAX technique to the pagination button, so we made the algorithm taking the loading process as the cue to start.
We streamlined the process allowing data scraping to be executing in 100 -150 streams simultaneously. This allowed us to collect 20 million customer reviews from Amazon within the duration of the project. For Walmart scraping, we did more than 1000 and ended up extracting data for up to 4 million products from the website. Pagination was repeated for over 1000 times per each provided keyword.
Our client has successfully launched an eControl service for its clients, and currently it is helping dozens of US brands to stay
The data pipeline is used to enable company’s legal investigation of unfair sales retail practices. As a result, the client has been using the collected data to counter unfair competition for big brands, and prevent their erosion as a result of damping prices. Working with us ensured they received the stable flow of fresh, quality data on provided keywords, products, and suppliers. Our cooperation is still ongoing.
Ready to discuss your idea?
Our team of experts will find and implement the best Web Scraping solution for your business. Drop us a line, and we will be back to you within 12 hours.
Let’s connect
Thank You!
We’ll get back to you as soon as possible!