Questions tagged [scrapy]

Scrapy is a fast open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

15666 questions
31
votes
1 answer

python No module named service_identity

I tried to update scrapy and when I tried to check the version I got the following error C:\Windows\system32>scrapy version -v :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named…
Marco Dinatsoli
  • 9,244
  • 33
  • 108
  • 224
31
votes
4 answers

How to access scrapy settings from item Pipeline

How do I access the scrapy settings in settings.py from the item pipeline. The documentation mentions it can be accessed through the crawler in extensions, but I don't see how to access the crawler in the pipelines.
avaleske
  • 1,643
  • 5
  • 16
  • 26
30
votes
2 answers

CrawlerProcess vs CrawlerRunner

Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script: using CrawlerProcess using CrawlerRunner What is the difference between the two? When should I use "process" and when "runner"?
alecxe
  • 414,977
  • 106
  • 935
  • 1,083
30
votes
3 answers

unknown command: crawl error

I am a newbie to python. I am running python 2.7.3 version 32 bit on 64 bit OS. (I tried 64 bit but it didn't workout). I followed the tutorial and installed scrapy on my machine. I have created one project, demoz. But when I enter scrapy crawl…
Nits
  • 579
  • 1
  • 5
  • 16
29
votes
3 answers

Send Post Request in Scrapy

I am trying to crawl the latest reviews from google play store and to get that I need to make a post request. With the Postman, it works and I get desired response. but a post request in terminal gives me a server error For ex: this page…
Amit Tripathi
  • 5,967
  • 3
  • 25
  • 50
28
votes
3 answers

Scrapy: how to disable or change log?

I've followed the official tutoral of Scrapy, it's wonderful! I'd like to remove all of DEBUG messages from console output. Is there a way? 2013-06-08 14:51:48+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6029 2013-06-08 14:51:48+0000…
realtebo
  • 19,593
  • 31
  • 85
  • 151
28
votes
7 answers

scraping the file with html saved in local system

For example i had a site "www.example.com" Actually i want to scrape the html of this site by saving on to local system. so for testing i saved that page on my desktop as example.html Now i had written the spider code for this as below class…
Shiva Krishna Bavandla
  • 20,872
  • 61
  • 174
  • 298
27
votes
6 answers

Scrapy - Reactor not Restartable

with: from twisted.internet import reactor from scrapy.crawler import CrawlerProcess I've always ran this process sucessfully: process = CrawlerProcess(get_project_settings()) process.crawl(*args) # the script will block here until the crawling is…
8-Bit Borges
  • 8,099
  • 15
  • 64
  • 132
27
votes
2 answers

Is it possible to pass a variable from start_requests() to parse() for each individual request?

I'm using a loop to generate my requests inside start_request() and I'd like to pass the index to parse() so it can store it in the item. However when I use self.i the output has the i max value (last loop turn) for every items. I can use…
ChiseledAbs
  • 1,583
  • 3
  • 14
  • 28
27
votes
4 answers

Passing a argument to a callback function

def parse(self, response): for sel in response.xpath('//tbody/tr'): item = HeroItem() item['hclass'] = response.request.url.split("/")[8].split('-')[-1] item['server'] = response.request.url.split('/')[2].split('.')[0] …
vic
  • 271
  • 1
  • 3
  • 4
27
votes
7 answers

Missing scheme in request URL

I've been stuck on this bug for a while, the following error message is as follows: File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\http\request\__init__.py", line 61, in _set_url raise ValueError('Missing scheme in…
Toby
  • 330
  • 1
  • 4
  • 10
27
votes
2 answers

How to disable or change the path of ghostdriver.log?

Question is straightfoward, but some context may help. I'm trying to deploy scrapy while using selenium and phantomjs as downloader. But the problem is that it keeps on saying permission denied when trying to deploy. So I want to change the path of…
Sam Stoelinga
  • 4,264
  • 7
  • 36
  • 52
27
votes
8 answers

suppress Scrapy Item printed in logs after pipeline

I have a scrapy project where the item that ultimately enters my pipeline is relatively large and stores lots of metadata and content. Everything is working properly in my spider and pipelines. The logs, however, are printing out the entire scrapy…
dino
  • 2,713
  • 2
  • 26
  • 42
27
votes
3 answers

scrapy - parsing items that are paginated

I have a url of the form: example.com/foo/bar/page_1.html There are a total of 53 pages, each one of them has ~20 rows. I basically want to get all the rows from all the pages, i.e. ~53*20 items. I have working code in my parse method, that parses…
AlexBrand
  • 10,847
  • 18
  • 78
  • 130
26
votes
1 answer

ScrapyRT vs Scrapyd

We've been using Scrapyd service for a while up until now. It provides a nice wrapper around a scrapy project and its spiders letting to control the spiders via an HTTP API: Scrapyd is a service for running Scrapy spiders. It allows you to deploy…
alecxe
  • 414,977
  • 106
  • 935
  • 1,083