PHP scraper vs Python scraper. Episode 1
Title of the article is not exactly precise as it does not store any charts, timing or other digital parameters of two scrapers that provide solution of the same task. This is more a tutorial and review of tools provided by two different languages (spoiler: author’s on Python side).
So in the red corner, Python 3 (current version 3.6.2) + Scrapy framework + MySQLDb and in the blue corner, PHP (version is not critical) + CURL + Symfony DomCrawler (to scrape HTML) + React PHP (for multi-processing) + Doctrine/DBAL. As a separate note, on Scrapy: this framework started somewhere in 2009 (or maybe a little earlier) and it is intended for a single goal of scraping. As of now, project git repository has 20K stars, 5K forks, 150 pull requests and about 250 members. So we can say it is not losing its positions and even gets more popular.
We are going to start looking into technologies for one-threaded scraper, which is going to scrape a single page; then we will move to multiple pages and multiple threads.
For PHP scraper we are going to use
- php + curl
- DomCrawler (composer install symfony/dom-crawler)
- React PHP (composer install react/react)
- Doctrine DBAL (composer install doctrine/dbal)
For Python scraper we are going to use:
- python3 (3.6.2)
- Scrapy (pip install scrapy)
- MySQLDb (pip install mysqlclient)
In PHP case the project starts with a single file storing all spider logic. Later, ideally code should be divided into the class for connecting to the database, the class for loading pages etc. However, it won’t be possible with Scrapy. A project should be initialized at the beginning, for instance, scrapy startproject tutorial, which will create a minimum project structure for you will look like the following:
Entry Point and HTTP request
PHP scraper. We are using curl: initialize session, add settings, page url and get result from curl_exec; few lines of code and nothing complicated.
In Scrapy this is even easier. Spider is the descendant of scrapy.Spider class, it has start_urls parameter, an array of urls that spider should scrape. Adding page url there and that’s it: on start scraper will care about querying the page, handle redirect if needed and send response object into corresponding method of spider class. In terminal it will look like this:
php parser.php scrapy crawl example
Processing response body and parsing DOM
Here is why Scrapy seems so much nicer. Response body is an object of Response class, which owns three main methods, namely, css(), xpath() and re(). These methods accept the line, which is css filter for the first one; path expression for the second one and a regular expression for the third one. First two methods return SelectorList object, which can participate in filter chain. Example:
a_tags = response.xpath(‘//a’) – selecting all tags on the page
facebook_tags = a_tags.xpath(‘contains[./@href,”facebook.com”]) – grab FB links from already selected tags
twitter_tags = a_tags.xpath(‘contains[./@href,”twitter.com”]) – the same for Twitter
With PHP the same result is reached with the help of Symfony DomCrawler and passing response body received from curl, into its constructor. Additionally, DomCrawler is able to do a lot of cool tricks, described in detail in documentation. Cool tricks are various methods like siblings(), first(), reduce(), and others. At first glance, usage of these convenient methods gives unbeatable advantage in implementation but if we look deeper we will see that these methods are effective equally to initial selector (for instance, xpath /following-sibling::) or have similar solutions (extract_first()), or are a part of array_map method. By the way, Python core has a lot of various methods to work with arrays and lists, including map().
Thus, although we cannot say that Scrapy is able to scrape anything that DomCrawler is not able to scrape from DOM or otherwise, xpath filters seem to be more pleasant and convenient at work with Python and Scrapy.