Questions tagged [scrapy]

Scrapy is a fast open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

15666 questions

votes

1 answer

python No module named service_identity

I tried to update scrapy and when I tried to check the version I got the following error C:\Windows\system32>scrapy version -v :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named…

python python-2.7 scrapy

asked Jun 06 '14 at 19:40

Marco Dinatsoli

9,244
33
108
224

votes

4 answers

How to access scrapy settings from item Pipeline

How do I access the scrapy settings in settings.py from the item pipeline. The documentation mentions it can be accessed through the crawler in extensions, but I don't see how to access the crawler in the pipelines.

python scrapy settings pipeline

asked Dec 28 '12 at 21:19

avaleske

1,643
5
16
26

votes

2 answers

CrawlerProcess vs CrawlerRunner

Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script: using CrawlerProcess using CrawlerRunner What is the difference between the two? When should I use "process" and when "runner"?

python web-scraping scrapy

asked Sep 26 '16 at 14:52

alecxe

414,977
106
935
1,083

votes

3 answers

unknown command: crawl error

I am a newbie to python. I am running python 2.7.3 version 32 bit on 64 bit OS. (I tried 64 bit but it didn't workout). I followed the tutorial and installed scrapy on my machine. I have created one project, demoz. But when I enter scrapy crawl…

python scrapy web-crawler

asked Apr 12 '12 at 11:58

Nits

votes

3 answers

Send Post Request in Scrapy

I am trying to crawl the latest reviews from google play store and to get that I need to make a post request. With the Postman, it works and I get desired response. but a post request in terminal gives me a server error For ex: this page…

python python-2.7 scrapy web-crawler

asked May 20 '15 at 06:49

Amit Tripathi

5,967
3
25
50

votes

3 answers

Scrapy: how to disable or change log?

I've followed the official tutoral of Scrapy, it's wonderful! I'd like to remove all of DEBUG messages from console output. Is there a way? 2013-06-08 14:51:48+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6029 2013-06-08 14:51:48+0000…

python scrapy

asked Jun 08 '13 at 14:58

realtebo

19,593
31
85
151

votes

7 answers

scraping the file with html saved in local system

For example i had a site "www.example.com" Actually i want to scrape the html of this site by saving on to local system. so for testing i saved that page on my desktop as example.html Now i had written the spider code for this as below class…

python scrapy

asked Jun 05 '12 at 10:12

Shiva Krishna Bavandla

20,872
61
174
298

votes

6 answers

Scrapy - Reactor not Restartable

with: from twisted.internet import reactor from scrapy.crawler import CrawlerProcess I've always ran this process sucessfully: process = CrawlerProcess(get_project_settings()) process.crawl(*args) # the script will block here until the crawling is…

python scrapy web-crawler

asked Jan 05 '17 at 21:32

8-Bit Borges

8,099
15
64
132

votes

2 answers

Is it possible to pass a variable from start_requests() to parse() for each individual request?

I'm using a loop to generate my requests inside start_request() and I'd like to pass the index to parse() so it can store it in the item. However when I use self.i the output has the i max value (last loop turn) for every items. I can use…

scrapy

asked Jan 01 '17 at 09:32

ChiseledAbs

1,583
3
14
28

votes

4 answers

Passing a argument to a callback function

def parse(self, response): for sel in response.xpath('//tbody/tr'): item = HeroItem() item['hclass'] = response.request.url.split("/")[8].split('-')[-1] item['server'] = response.request.url.split('/')[2].split('.')[0] …

python callback arguments scrapy

asked Aug 27 '15 at 14:30

vic

votes

7 answers

Missing scheme in request URL

I've been stuck on this bug for a while, the following error message is as follows: File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\http\request\__init__.py", line 61, in _set_url raise ValueError('Missing scheme in…

python url scrapy

asked Jan 13 '14 at 23:39

Toby

votes

2 answers

How to disable or change the path of ghostdriver.log?

Question is straightfoward, but some context may help. I'm trying to deploy scrapy while using selenium and phantomjs as downloader. But the problem is that it keeps on saying permission denied when trying to deploy. So I want to change the path of…

scrapy phantomjs ghostdriver

asked Jun 11 '13 at 15:57

Sam Stoelinga

4,264
7
36
52

votes

8 answers

suppress Scrapy Item printed in logs after pipeline

I have a scrapy project where the item that ultimately enters my pipeline is relatively large and stores lots of metadata and content. Everything is working properly in my spider and pipelines. The logs, however, are printing out the entire scrapy…

python scrapy

asked Jan 18 '13 at 01:06

dino

2,713
2
26
42

votes

3 answers

scrapy - parsing items that are paginated

I have a url of the form: example.com/foo/bar/page_1.html There are a total of 53 pages, each one of them has ~20 rows. I basically want to get all the rows from all the pages, i.e. ~53*20 items. I have working code in my parse method, that parses…

python scrapy

asked Oct 11 '12 at 20:26

AlexBrand

10,847
18
78
130

votes

1 answer

ScrapyRT vs Scrapyd

We've been using Scrapyd service for a while up until now. It provides a nice wrapper around a scrapy project and its spiders letting to control the spiders via an HTTP API: Scrapyd is a service for running Scrapy spiders. It allows you to deploy…

python web-scraping scrapy scrapyd

asked May 17 '16 at 18:16

alecxe

414,977
106
935
1,083

Prev 1 2 3

…

99 100 Next