Scrapy is good for scraping static web pages using python but when it comes to dynamic web pages scrapy can't do wonders, and there comes Selenium
but as good as selenium is, it just got beaten by Scrapy in terms or speed.
Web nowdays is all about Dynamic JS based pages and AJAX. So for this very scenario the guys over scrapy-plugins created scrapy-splash
. Scrapy-Splash is a plugin that connects Scrapy with Splash (Lightweight, scriptable browser as a service with an HTTP API). In a nutshell what splash do is it traps the response recieved from the server and renders it. Then it return a render.html
which is static and can be easily scraped.
0 - Setting up the machine
A. Before we begin you need to install Docker
first, You can follow the official instruction as per your Operating System.
B. After installing docker navigate to your project folder, activate virtualenv
and install scrapy-splsh plugin
C. Pull the Splash Docker Image and run it
1 - Configuration
A. Add the Splash server address to settings.py
of your Scrapy project like this:
=
If you are running docker on your local machine then you can simply use http://localhost:<port>
, but if you are running it on a remote machine you need to specify it's I.P. Address like this http://192.168.59.103:<port>
B. Enable the Splash middleware by adding it to DOWNLOADER_MIDDLEWARES
in your settings.py
file and changing HttpCompressionMiddleware priority:
=
C. Enable SplashDeduplicateArgsMiddleware by adding it to SPIDER_MIDDLEWARES in your settings.py:
=
D. Set a custom DUPEFILTER_CLASS:
=
2 - Scraping with Splash
Before you use scrapy-splash
you need to import it in your spider. You can do that by adding this line:
from now on insted of using scrapy.Request
you can simply use SplashRequest
to get response from Splash
insted of directly from ther server.
Bonus - Using Scrapy-Splash in Shell
It's all well and good but actual spider buiding does not happens in vim
or sublime
, it takes place in shell
.
So how to use Splash in the shell?
Good Question.
Insted of invoking shell with:
>>> fetch()
or with this:
You invoke shell with this:
Let me explain
localhost:port
is where your splash service is runningurl
is url you want to crawlrender.html
is one of the possible http api endpoints, returns redered html page in this casetimeout
time in seconds for timeoutwait
time in seconds to wait for javascript to execute before reading/saving the html.
If I’ve missed something, made a horrible mistake of if you have any questions regarding this article then feel free to ping me on Twitter. I’m @aaqaishtyaq.