-4
  1. Web scraping is efficient on windows or Ubuntu?
  2. For scraping quotes from web which would be better scrapy or beautiful soap?
JJJ
  • 31,545
  • 20
  • 84
  • 99
saavi
  • 1
  • 1

1 Answers1

1

Question1: Efficiency

  • This is very borad coverage question. Basically the efficiency depends on following criterias:

    • Computer Performance
    • Network Stability
    • Anti-antispider techniques
    • Extraction Method
    • Business purpose
  • Here are the way they affect your efficiency:

    • Computer Perfomace

      If you are targeting big-ecommerce sites or if the site uses massive amount of javascript(like LinkedIn) then you should consider a moderate perfomace instance(computer) to finish your job. Note that if your computer memeory is too small, scrapy-splash's docker container will automatically stop and cause your spider to fail immediatly. Compared to Windows Home Edition, you'd better choose Ubuntu, because ubuntu uses less memory etc.. No matter which operation system you choose, Server-Edition is always better than Home-Edition.

    • Network Stability

      Network Stability counts when your instance(computer) is too far away from your targeting site. Your network speed and latency will directly affect your spider and sometimes causes disaster. Low network speed will slow down your request speed, while latency will sometimes cause your spider failed to load your target webpage. This will led to error in future content extraction. Your program may catch exceptions and quit immedialy and if you are not using modern spider framwork, the error page will not be re-fetched in the future, so you will lose some data. Compared to a home-network-spider, deploy a spider on public cloud is a better solution.

      • Some cloud VPS Provider that you can choose from: Link
      • You can use IP Location Detector to find the location of your target site: Link
    • Anti-antispider techniques

    • Extraction Method

      This is really a borad topic, You can use fast-performance-techniques to locate elements like xpaths, bs4, css.... while you can also use low-performance-techniques like Deep Learning or Search even regular expression

Question 2: Scrapy vs bs4 for scraping quotes from web

  • Scrapy is a scraping framwork while bs4 is a content extracting framework, so the answer is you can use BeautifulSoup in Scrapy.
  • There is also some User Friendly Scraping software