Web Scraping platform efficiency

Question

Web scraping is efficient on windows or Ubuntu?
For scraping quotes from web which would be better scrapy or beautiful soap?

score 1 · Answer 1 · answered Nov 15 '17 at 16:27

Question1: Efficiency

This is very borad coverage question. Basically the efficiency depends on following criterias:
- Computer Performance
- Network Stability
- Anti-antispider techniques
- Extraction Method
- Business purpose
Here are the way they affect your efficiency:
- Computer Perfomace
  
  If you are targeting big-ecommerce sites or if the site uses massive amount of javascript(like LinkedIn) then you should consider a moderate perfomace instance(computer) to finish your job. Note that if your computer memeory is too small, scrapy-splash's docker container will automatically stop and cause your spider to fail immediatly. Compared to Windows Home Edition, you'd better choose Ubuntu, because ubuntu uses less memory etc.. No matter which operation system you choose, Server-Edition is always better than Home-Edition.
  - Circumstance 1: (Using Scrapy® with Selenium Web Driver)
    - Example1: Linked-In Sales Navigator Spider
    - Example2: WeiFeng Spider with reCAPCHA crackted
  - Circumstance 2: (Using Scrapy with Scrapy-spalsh javascript rendering service)
    - Example1: INC5000 Spider
- Network Stability
  
  Network Stability counts when your instance(computer) is too far away from your targeting site. Your network speed and latency will directly affect your spider and sometimes causes disaster. Low network speed will slow down your request speed, while latency will sometimes cause your spider failed to load your target webpage. This will led to error in future content extraction. Your program may catch exceptions and quit immedialy and if you are not using modern spider framwork, the error page will not be re-fetched in the future, so you will lose some data. Compared to a home-network-spider, deploy a spider on public cloud is a better solution.
  - Some cloud VPS Provider that you can choose from: Link
  - You can use IP Location Detector to find the location of your target site: Link
- Anti-antispider techniques
  - IP Rotation:
    - Method 1: Use Scrapy with Crawlera
    - Method 2: Proxy Pool
  - UA Rotation:
    - Scrapy with UA Rotation Framework
  - Download Delay:
    - Scrapy with Download Delay
- Extraction Method
  
  This is really a borad topic, You can use fast-performance-techniques to locate elements like xpaths, bs4, css.... while you can also use low-performance-techniques like Deep Learning or Search even regular expression
  - Beautiful Soup (aka bs4):
    - Intro-bs4
    - BeutifulSoup is not a good solution if you want to parse complicate website HTML. It does not support xpath or css selectors so that you will have to manually figure out the website's element hierarchic and result in some code likes this.
```
for level1 in bs_obj.find_all("div", {"id", "classname"})
    for level 2 in bs_obj.find_all("div", {"class", "classname"})
        for level3 in bs_obj.find_all("a", {"class", "classname"})
            for level4 ........
                ....
                    ....
```
  - Xpath and CSS selectors:
  - Regular Expression (aka re):
    - This is a good way to extract content with a specified string pattern but it is slower than others.
    - Regular Expression Tutorial
  - Scrapy is integrated with xpath, css, re through parsel, you can check this tutorial to lean how to use them within scrapy

Question 2: Scrapy vs bs4 for scraping quotes from web

Scrapy is a scraping framwork while bs4 is a content extracting framework, so the answer is you can use BeautifulSoup in Scrapy.
There is also some User Friendly Scraping software
- 7 Tools for web scraping
- Even scrapy company (scrapinghub) is developing their own open source H5 based scraper: Portia

Web Scraping platform efficiency

1 Answers1

Question1: Efficiency

Question 2: Scrapy vs bs4 for scraping quotes from web