Scrapy is good for scraping static web pages using python but when it comes to dynamic web pages scrapy can't do wonders, and there comes Selenium but as good as selenium is, it just got beaten by Scrapy in terms or speed.
Web nowdays is all about Dynamic JS based pages and AJAX. So for this very scenario the guys over scrapy-plugins created scrapy-splash. Scrapy-Splash is a plugin that connects Scrapy with Splash (Lightweight, scriptable browser as a service with an HTTP API). In a nutshell what splash do is it traps the response recieved from the server and renders it. Then it return a render.html which is static and can be easily scraped.
0 - Setting up the machine
A. Before we begin you need to install Docker first, You can follow the official instruction as per your Operating System.
B. After installing docker navigate to your project folder, activate virtualenv and install scrapy-splsh plugin
C. Pull the Splash Docker Image and run it
1 - Configuration
A. Add the Splash server address to settings.py of your Scrapy project like this:
=
If you are running docker on your local machine then you can simply use http://localhost:<port> , but if you are running it on a remote machine you need to specify it's I.P. Address like this http://192.168.59.103:<port>
B. Enable the Splash middleware by adding it to DOWNLOADER_MIDDLEWARES in your settings.py file and changing HttpCompressionMiddleware priority:
=
C. Enable SplashDeduplicateArgsMiddleware by adding it to SPIDER_MIDDLEWARES in your settings.py:
=
D. Set a custom DUPEFILTER_CLASS:
=
2 - Scraping with Splash
Before you use scrapy-splash you need to import it in your spider. You can do that by adding this line:
from now on insted of using scrapy.Request you can simply use SplashRequest to get response from Splash insted of directly from ther server.
Bonus - Using Scrapy-Splash in Shell
It's all well and good but actual spider buiding does not happens in vim or sublime, it takes place in shell.
So how to use Splash in the shell?
Good Question.
Insted of invoking shell with:
>>> fetch()
or with this:
You invoke shell with this:
Let me explain
localhost:portis where your splash service is runningurlis url you want to crawlrender.htmlis one of the possible http api endpoints, returns redered html page in this casetimeouttime in seconds for timeoutwaittime in seconds to wait for javascript to execute before reading/saving the html.
If I’ve missed something, made a horrible mistake of if you have any questions regarding this article then feel free to ping me on Twitter. I’m @aaqaishtyaq.