Questions tagged [scrapy]

Scrapy is a fast open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

15666 questions
368
votes
3 answers

Headless Browser and scraping - solutions

I'm trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of scraping. BROWSER TESTING / SCRAPING: Selenium - polyglot flagship in browser automation, bindings for Python, Ruby, …
Inoperable
  • 1,339
  • 5
  • 15
  • 30
236
votes
23 answers

Cannot install Lxml on Mac OS X 10.9

I want to install Lxml so I can then install Scrapy. When I updated my Mac today it wouldn't let me reinstall lxml, I get the following error: In file included from…
David O'Regan
  • 2,594
  • 2
  • 11
  • 12
207
votes
18 answers

"OSError: [Errno 1] Operation not permitted" when installing Scrapy in OSX 10.11 (El Capitan) (System Integrity Protection)

I'm trying to install Scrapy Python framework in OSX 10.11 (El Capitan) via pip. The installation script downloads the required modules and at some point returns the following error: OSError: [Errno 1] Operation not permitted:…
Luis U.
  • 2,452
  • 2
  • 14
  • 15
157
votes
19 answers

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org

I'm practicing the code from 'Web Scraping with Python', and I keep having this certificate problem: from urllib.request import urlopen from bs4 import BeautifulSoup import re pages = set() def getLinks(pageUrl): global pages html =…
Catherine4j
  • 1,592
  • 2
  • 6
  • 9
153
votes
8 answers

Can scrapy be used to scrape dynamic content from websites that are using AJAX?

I have recently been learning Python and am dipping my hand into building a web-scraper. It's nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel. Most of the issues are solvable and…
Joseph
  • 3,749
  • 9
  • 28
  • 46
143
votes
9 answers

Difference between BeautifulSoup and Scrapy crawler?

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.
Nishant Bhakta
  • 2,587
  • 2
  • 18
  • 23
117
votes
5 answers

How to pass a user defined argument in scrapy spider

I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that? I read about a parameter -a somewhere but have no idea how to use it.
L Lawliet
  • 2,445
  • 4
  • 21
  • 34
102
votes
10 answers

How to use PyCharm to debug Scrapy projects

I am working on Scrapy 0.20 with Python 2.7. I found PyCharm has a good Python debugger. I want to test my Scrapy spiders using it. Anyone knows how to do that please? What I have tried Actually I tried to run the spider as a script. As a result, I…
William Kinaan
  • 25,507
  • 20
  • 76
  • 115
93
votes
3 answers

pip is not able to install packages correctly: Permission denied error

I am trying to install lxml to install scrapy on my Mac (v 10.9.4) ╭─ishaantaylor@Ishaans-MacBook-Pro.local ~ ╰─➤ pip install lxml Downloading/unpacking lxml Downloading lxml-3.4.0.tar.gz (3.5MB): 3.5MB downloaded Running setup.py…
Ishaan Taylor
  • 1,459
  • 3
  • 12
  • 18
90
votes
10 answers

How can I use different pipelines for different spiders in a single Scrapy project

I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider. Thanks
CodeMonkeyB
  • 2,610
  • 3
  • 18
  • 28
89
votes
2 answers

selenium with scrapy for dynamic page

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this: starts with a product_list page with 10 products a click on "next" button loads the next 10 products (url doesn't change between the…
Z. Lin
  • 1,292
  • 2
  • 10
  • 16
78
votes
8 answers

How to run Scrapy from within a Python script

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this: http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/ http://snipplr.com/view/67006/using-scrapy-from-a-script/ I can't…
user47954
  • 889
  • 1
  • 7
  • 4
64
votes
10 answers

Scrapy Unit Testing

I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing…
ciferkey
  • 1,883
  • 1
  • 21
  • 28
62
votes
1 answer

Using Scrapy with authenticated (logged in) user session

In the Scrapy docs, there is the following example to illustrate how to use an authenticated session in Scrapy: class LoginSpider(BaseSpider): name = 'example.com' start_urls = ['http://www.example.com/users/login.php'] def parse(self,…
Herman Schaaf
  • 39,417
  • 19
  • 92
  • 137
59
votes
3 answers

getting Forbidden by robots.txt: scrapy

while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/> ERROR: No response downloaded for: https://www.netflix.com/
deepak kumar
  • 593
  • 1
  • 4
  • 4
1
2 3
99 100