Scrapy is great for scraping static web pages with Python, but when it comes to dynamic pages it can only do so much. That's where Selenium usually comes in, but as good as Selenium is, Scrapy still beats it in terms of speed.
The web these days is full of dynamic JS-based pages and AJAX. For exactly that scenario, the folks over at scrapy-plugins created scrapy-splash. Scrapy-Splash is a plugin that connects Scrapy with Splash (a lightweight, scriptable browser as a service with an HTTP API). In a nutshell, Splash takes the response received from the server and renders it. Then it returns a render.html response which is static and can be scraped easily.
0 - Setting up the machine
A. Before we begin you need to install Docker first, You can follow the official instruction as per your Operating System.
B. After installing Docker, navigate to your project folder, activate virtualenv and install the scrapy-splash plugin
pip3 install scrapy-splashC. Pull the Splash Docker Image and run it
docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash1 - Configuration
A. Add the Splash server address to settings.py of your Scrapy project like this:
SPLASH_URL = 'http://localhost:8050'If you are running docker on your local machine then you can simply use http://localhost:<port> , but if you are running it on a remote machine you need to specify it's I.P. Address like this http://192.168.59.103:<port>
B. Enable the Splash middleware by adding it to DOWNLOADER_MIDDLEWARES in your settings.py file and changing HttpCompressionMiddleware priority:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}C. Enable SplashDeduplicateArgsMiddleware by adding it to SPIDER_MIDDLEWARES in your settings.py:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}D. Set a custom DUPEFILTER_CLASS:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'2 - Scraping with Splash
Before you use scrapy-splash you need to import it in your spider. You can do that by adding this line:
from scrapy_splash import SplashRequestfrom now on insted of using scrapy.Request you can simply use SplashRequest to get response from Splash insted of directly from ther server.
Bonus - Using Scrapy-Splash in Shell
It's all well and good but actual spider buiding does not happens in vim or sublime, it takes place in shell.
So how to use Splash in the shell?
Good Question.
Insted of invoking shell with:
scrapy shell
>>> fetch(http://domain.com/page-with-javascript.html)or with this:
scrapy shell http://domain.com/page-with-javascript.htmlYou invoke shell with this:
scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'Let me explain
localhost:portis where your splash service is runningurlis url you want to crawlrender.htmlis one of the possible http api endpoints, returns redered html page in this casetimeouttime in seconds for timeoutwaittime in seconds to wait for javascript to execute before reading/saving the html.
If I’ve missed something, made a horrible mistake of if you have any questions regarding this article then feel free to ping me on Twitter. I’m @aaqaishtyaq.