Questions tagged [scrapy]

Scrapy is a fast open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

15666 questions

votes

3 answers

How to bypass cloudflare bot/ddos protection in Scrapy?

I used to scrape e-commerce webpage occasionally to get product prices information. I have not used the scraper built using Scrapy in a while and yesterday was trying to use it - I run into a problem with bot protection. It is using CloudFlare’s…

javascript python cookies scrapy

asked Oct 20 '15 at 22:07

Kulbi

votes

3 answers

scrapy: convert html string to HtmlResponse object

I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css and xpath, similar to scrapy's response. How can I do it?

python web-scraping scrapy

asked Dec 05 '14 at 19:59

yayu

6,552
16
46
78

votes

2 answers

How to generate the start_urls dynamically in crawling?

I am crawling a site which may contain a lot of start_urls, like: http://www.a.com/list_1_2_3.htm I want to populate start_urls like [list_\d+_\d+_\d+\.htm], and extract items from URLs like [node_\d+\.htm] during crawling. Can I use CrawlSpider…

web-scraping scrapy web-crawler

asked Feb 17 '12 at 02:49

user1215269

votes

10 answers

Scrapy Crawl URLs in Order

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below. from scrapy.spider import BaseSpider from scrapy.selector import…

python sorting asynchronous hashmap scrapy

asked Jul 04 '11 at 00:09

Jeff

votes

5 answers

pyconfig.h missing during "pip install cryptography"

I wanna set up scrapy cluster follow this link scrapy-cluster,Everything is ok before I run this command: pip install -r requirements.txt The requirements.txt looks…

python cryptography centos scrapy pip

asked Oct 14 '16 at 08:09

FancyXun

votes

2 answers

How to force scrapy to crawl duplicate url?

I am learning Scrapy a web crawling framework. by default it does not crawl duplicate urls or urls which scrapy have already crawled. How to make Scrapy to crawl duplicate urls or urls which have already crawled? I tried to find out on internet…

python web-crawler scrapy

asked Apr 17 '14 at 10:57

Alok Singh Mahor

5,210
5
36
53

votes

5 answers

Scrapy - Silently drop an item

I am using Scrapy to crawl several websites, which may share redundant information. For each page I scrape, I store the url of the page, its title and its html code, into mongoDB. I want to avoid duplication in database, thus, I implement a pipeline…

python scrapy

asked Nov 23 '12 at 11:13

Balthazar Rouberol

5,882
2
30
41

votes

4 answers

How to setup and launch a Scrapy spider programmatically (urls and settings)

I've written a working crawler using scrapy, now I want to control it through a Django webapp, that is to say: Set 1 or several start_urls Set 1 or several allowed_domains Set settings values Start the spider Stop / pause / resume a…

python scrapy scrapyd

asked Oct 21 '12 at 10:10

arno

votes

2 answers

how to implement nested item in scrapy?

I am scraping some data with complex hierarchical info and need to export the result to json. I defined the items as class FamilyItem(): name = Field() sons = Field() class SonsItem(): name = Field() grandsons = Field() class…

python json scrapy

asked Jun 25 '12 at 06:46

Shadow Lau

votes

2 answers

How do I use the Python Scrapy module to list all the URLs from my website?

I want to use the Python Scrapy module to scrape all the URLs from my website and write the list to a file. I looked in the examples but didn't see any simple example to do this.

python web-crawler scrapy

asked Mar 05 '12 at 02:43

Adam F

1,126
1
10
15

votes

2 answers

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items. Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding…

python scrapy

asked Apr 28 '11 at 22:45

StefanH

votes

3 answers

ImportError: No module named win32api while using Scrapy

I am a new learner of Scrapy. I installed python 2.7 and all other engines needed. Then I tried to build a Scrapy project following the tutorial http://doc.scrapy.org/en/latest/intro/tutorial.html. In the crawling step, after I typed scrapy crawl…

python scrapy scrapy-spider

asked Sep 15 '15 at 12:45

李皓伟

votes

3 answers

Geopy: catch timeout error

I am using geopy to geocode some addresses and I want to catch the timeout errors and print them out so I can do some quality control on the input. I am putting the geocode request in a try/catch but it's not working. Any ideas on what I need to do?…

python scrapy geopy

asked Jan 13 '15 at 03:58

MoreScratch

2,511
5
27
51

votes

2 answers

Can Scrapy be replaced by pyspider?

I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider, which, according to it's github page, is fresh, actively developed and popular. pyspider's home…

python web-scraping scrapy web-crawler pyspider

asked Dec 02 '14 at 06:33

alecxe

414,977
106
935
1,083

votes

2 answers

How to use CrawlSpider from scrapy to click a link with javascript onclick?

I want scrapy to crawl pages where going on to the next link looks like this: Next Will scrapy be able to interpret javascript code of that? With livehttpheaders extension I found out that clicking…

javascript python onclick scrapy web-scraping

asked Mar 16 '10 at 14:12

ria

5,688
5
24
33

Prev 1 2 3

…

99 100 Next