Questions tagged [scrapy]

Scrapy is a fast open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

15666 questions
26
votes
3 answers

How to bypass cloudflare bot/ddos protection in Scrapy?

I used to scrape e-commerce webpage occasionally to get product prices information. I have not used the scraper built using Scrapy in a while and yesterday was trying to use it - I run into a problem with bot protection. It is using CloudFlare’s…
Kulbi
  • 901
  • 1
  • 10
  • 16
26
votes
3 answers

scrapy: convert html string to HtmlResponse object

I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css and xpath, similar to scrapy's response. How can I do it?
yayu
  • 6,552
  • 16
  • 46
  • 78
25
votes
2 answers

How to generate the start_urls dynamically in crawling?

I am crawling a site which may contain a lot of start_urls, like: http://www.a.com/list_1_2_3.htm I want to populate start_urls like [list_\d+_\d+_\d+\.htm], and extract items from URLs like [node_\d+\.htm] during crawling. Can I use CrawlSpider…
user1215269
  • 251
  • 1
  • 3
  • 3
25
votes
10 answers

Scrapy Crawl URLs in Order

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below. from scrapy.spider import BaseSpider from scrapy.selector import…
Jeff
  • 275
  • 1
  • 3
  • 5
25
votes
5 answers

pyconfig.h missing during "pip install cryptography"

I wanna set up scrapy cluster follow this link scrapy-cluster,Everything is ok before I run this command: pip install -r requirements.txt The requirements.txt looks…
FancyXun
  • 771
  • 1
  • 5
  • 16
25
votes
2 answers

How to force scrapy to crawl duplicate url?

I am learning Scrapy a web crawling framework. by default it does not crawl duplicate urls or urls which scrapy have already crawled. How to make Scrapy to crawl duplicate urls or urls which have already crawled? I tried to find out on internet…
Alok Singh Mahor
  • 5,210
  • 5
  • 36
  • 53
25
votes
5 answers

Scrapy - Silently drop an item

I am using Scrapy to crawl several websites, which may share redundant information. For each page I scrape, I store the url of the page, its title and its html code, into mongoDB. I want to avoid duplication in database, thus, I implement a pipeline…
Balthazar Rouberol
  • 5,882
  • 2
  • 30
  • 41
25
votes
4 answers

How to setup and launch a Scrapy spider programmatically (urls and settings)

I've written a working crawler using scrapy, now I want to control it through a Django webapp, that is to say: Set 1 or several start_urls Set 1 or several allowed_domains Set settings values Start the spider Stop / pause / resume a…
arno
  • 497
  • 1
  • 5
  • 14
25
votes
2 answers

how to implement nested item in scrapy?

I am scraping some data with complex hierarchical info and need to export the result to json. I defined the items as class FamilyItem(): name = Field() sons = Field() class SonsItem(): name = Field() grandsons = Field() class…
Shadow Lau
  • 441
  • 1
  • 7
  • 9
24
votes
2 answers

How do I use the Python Scrapy module to list all the URLs from my website?

I want to use the Python Scrapy module to scrape all the URLs from my website and write the list to a file. I looked in the examples but didn't see any simple example to do this.
Adam F
  • 1,126
  • 1
  • 10
  • 15
24
votes
2 answers

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items. Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding…
StefanH
  • 721
  • 1
  • 5
  • 21
24
votes
3 answers

ImportError: No module named win32api while using Scrapy

I am a new learner of Scrapy. I installed python 2.7 and all other engines needed. Then I tried to build a Scrapy project following the tutorial http://doc.scrapy.org/en/latest/intro/tutorial.html. In the crawling step, after I typed scrapy crawl…
李皓伟
  • 279
  • 1
  • 2
  • 3
24
votes
3 answers

Geopy: catch timeout error

I am using geopy to geocode some addresses and I want to catch the timeout errors and print them out so I can do some quality control on the input. I am putting the geocode request in a try/catch but it's not working. Any ideas on what I need to do?…
MoreScratch
  • 2,511
  • 5
  • 27
  • 51
24
votes
2 answers

Can Scrapy be replaced by pyspider?

I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider, which, according to it's github page, is fresh, actively developed and popular. pyspider's home…
alecxe
  • 414,977
  • 106
  • 935
  • 1,083
24
votes
2 answers

How to use CrawlSpider from scrapy to click a link with javascript onclick?

I want scrapy to crawl pages where going on to the next link looks like this: Next Will scrapy be able to interpret javascript code of that? With livehttpheaders extension I found out that clicking…
ria
  • 5,688
  • 5
  • 24
  • 33