Questions tagged [scrapy]

Scrapy is a fast open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

15666 questions

votes

3 answers

Scrapy Python Set up User Agent

I tried to override the user-agent of my crawlspider by adding an extra line to the project configuration file. Here is the code: [settings] default = myproject.settings USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML,…

python scrapy web-crawler screen-scraping user-agent

asked Sep 20 '13 at 15:52

B.Mr.W.

16,522
30
96
156

votes

7 answers

scrapy text encoding

Here is my spider from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from vrisko.items import VriskoItem class…

scrapy

asked Feb 07 '12 at 17:51

mindcast

votes

2 answers

Running Scrapy spiders in a Celery task

I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users. Something like this: class…

python django scrapy celery

asked Jul 17 '12 at 18:36

stryderjzw

votes

3 answers

Scrapy: Follow link to get additional Item data?

I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework: The structure of the data I want to scrape is typically a table row for each item. Straightforward enough,…

hyperlink callback scrapy

asked Feb 17 '12 at 19:54

dru

votes

4 answers

Force my scrapy spider to stop crawling

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after…

python scrapy

asked Dec 15 '10 at 10:05

no1

votes

2 answers

Access Django models with scrapy: defining path to Django project

I'm very new to Python and Django. I'm currently exploring using Scrapy to scrape sites and save data to the Django database. My goal is to run a spider based on domain given by a user. I've written a spider that extracts the data I need and store…

python django django-models scrapy

asked Sep 28 '13 at 15:02

Splurk

votes

6 answers

How to give URL to scrapy for crawling?

I want to use scrapy for crawling web pages. Is there a way to pass the start URL from the terminal itself? It is given in the documentation that either the name of the spider or the URL can be given, but when i given the url it throws an…

scrapy web-crawler

asked Mar 13 '12 at 09:11

G Gill

1,057
1
12
21

votes

2 answers

Scrapy css selector: get text of all inner tags

I have a tag and I want to get all the text inside available. I am doing this: response.css('mytag::text') But it is only getting the text of the current tag, I also want to get the text from all the inner tags. I know I could do something…

html css scrapy

asked Dec 05 '16 at 23:12

Jgaldos

votes

4 answers

Crawling with an authenticated session in Scrapy

In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word…

python scrapy

asked May 01 '11 at 20:34

Herman Schaaf

39,417
19
92
137

votes

8 answers

Access django models inside of Scrapy

Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model? I've seen this, but I don't really get how to set it up?

python django django-models scrapy

asked Nov 24 '10 at 22:09

imns

4,728
11
51
77

votes

6 answers

Scrapy throws ImportError: cannot import name xmlrpc_client

After install Scrapy via pip, and having Python 2.7.10: scrapy Traceback (most recent call last): File "/usr/local/bin/scrapy", line 7, in from scrapy.cmdline import execute File "/Library/Python/2.7/site-packages/scrapy/__init__.py", line…

python python-2.7 scrapy

asked Jun 21 '15 at 13:06

ilopezluna

4,925
7
37
66

votes

4 answers

Run a Scrapy spider in a Celery Task

This is not working anymore, scrapy's API has changed. Now the documentation feature a way to "Run Scrapy from a script" but I get the ReactorNotRestartable error. My task: from celery import Task from twisted.internet import reactor from…

scrapy twisted celery

asked Mar 01 '14 at 15:46

Juan Riaza

1,378
2
13
33

votes

6 answers

Best way for a beginner to learn screen scraping by Python

This might be one of those questions that are difficult to answer, but here goes: I don't consider my self programmer - but I would like to :-) I've learned R, because I was sick and tired of spss, and because a friend introduced me to the language…

python screen-scraping beautifulsoup lxml scrapy

asked Dec 01 '10 at 19:31

Andreas

6,066
12
49
65

votes

7 answers

ReactorNotRestartable error in while loop with scrapy

I get twisted.internet.error.ReactorNotRestartable error when I execute following code: from time import sleep from scrapy import signals from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings from…

python python-2.7 scrapy twisted

asked Oct 09 '16 at 17:47

k_wit

votes

3 answers

Passing arguments to process.crawl in Scrapy python

I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from…

python web-crawler scrapy scrapy-spider google-crawlers

asked Dec 20 '15 at 15:06

yusuf

3,062
5
31
77

Prev 1 2

…

99 100 Next