Questions tagged [scrapy]

Scrapy is a fast open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

15666 questions
38
votes
3 answers

Scrapy Python Set up User Agent

I tried to override the user-agent of my crawlspider by adding an extra line to the project configuration file. Here is the code: [settings] default = myproject.settings USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML,…
B.Mr.W.
  • 16,522
  • 30
  • 96
  • 156
36
votes
7 answers

scrapy text encoding

Here is my spider from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from vrisko.items import VriskoItem class…
mindcast
  • 687
  • 1
  • 8
  • 12
36
votes
2 answers

Running Scrapy spiders in a Celery task

I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users. Something like this: class…
stryderjzw
  • 552
  • 1
  • 6
  • 9
35
votes
3 answers

Scrapy: Follow link to get additional Item data?

I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework: The structure of the data I want to scrape is typically a table row for each item. Straightforward enough,…
dru
  • 678
  • 1
  • 8
  • 11
35
votes
4 answers

Force my scrapy spider to stop crawling

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after…
no1
  • 835
  • 2
  • 9
  • 9
35
votes
2 answers

Access Django models with scrapy: defining path to Django project

I'm very new to Python and Django. I'm currently exploring using Scrapy to scrape sites and save data to the Django database. My goal is to run a spider based on domain given by a user. I've written a spider that extracts the data I need and store…
Splurk
  • 619
  • 1
  • 5
  • 13
34
votes
6 answers

How to give URL to scrapy for crawling?

I want to use scrapy for crawling web pages. Is there a way to pass the start URL from the terminal itself? It is given in the documentation that either the name of the spider or the URL can be given, but when i given the url it throws an…
G Gill
  • 1,057
  • 1
  • 12
  • 21
34
votes
2 answers

Scrapy css selector: get text of all inner tags

I have a tag and I want to get all the text inside available. I am doing this: response.css('mytag::text') But it is only getting the text of the current tag, I also want to get the text from all the inner tags. I know I could do something…
Jgaldos
  • 395
  • 1
  • 4
  • 9
33
votes
4 answers

Crawling with an authenticated session in Scrapy

In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word…
Herman Schaaf
  • 39,417
  • 19
  • 92
  • 137
33
votes
8 answers

Access django models inside of Scrapy

Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model? I've seen this, but I don't really get how to set it up?
imns
  • 4,728
  • 11
  • 51
  • 77
33
votes
6 answers

Scrapy throws ImportError: cannot import name xmlrpc_client

After install Scrapy via pip, and having Python 2.7.10: scrapy Traceback (most recent call last): File "/usr/local/bin/scrapy", line 7, in from scrapy.cmdline import execute File "/Library/Python/2.7/site-packages/scrapy/__init__.py", line…
ilopezluna
  • 4,925
  • 7
  • 37
  • 66
33
votes
4 answers

Run a Scrapy spider in a Celery Task

This is not working anymore, scrapy's API has changed. Now the documentation feature a way to "Run Scrapy from a script" but I get the ReactorNotRestartable error. My task: from celery import Task from twisted.internet import reactor from…
Juan Riaza
  • 1,378
  • 2
  • 13
  • 33
32
votes
6 answers

Best way for a beginner to learn screen scraping by Python

This might be one of those questions that are difficult to answer, but here goes: I don't consider my self programmer - but I would like to :-) I've learned R, because I was sick and tired of spss, and because a friend introduced me to the language…
Andreas
  • 6,066
  • 12
  • 49
  • 65
32
votes
7 answers

ReactorNotRestartable error in while loop with scrapy

I get twisted.internet.error.ReactorNotRestartable error when I execute following code: from time import sleep from scrapy import signals from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings from…
k_wit
  • 421
  • 1
  • 4
  • 5
32
votes
3 answers

Passing arguments to process.crawl in Scrapy python

I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from…
yusuf
  • 3,062
  • 5
  • 31
  • 77