- Web scraping is efficient on windows or Ubuntu?
- For scraping quotes from web which would be better scrapy or beautiful soap?
1 Answers
Question1: Efficiency
This is very borad coverage question. Basically the efficiency depends on following criterias:
- Computer Performance
- Network Stability
- Anti-antispider techniques
- Extraction Method
- Business purpose
Here are the way they affect your efficiency:
Computer Perfomace
If you are targeting big-ecommerce sites or if the site uses massive amount of javascript(like LinkedIn) then you should consider a moderate perfomace instance(computer) to finish your job. Note that if your computer memeory is too small,
scrapy-splash
's docker container will automatically stop and cause your spider to fail immediatly. Compared toWindows Home Edition
, you'd better chooseUbuntu
, because ubuntu uses less memory etc.. No matter which operation system you choose,Server-Edition
is always better thanHome-Edition
.Circumstance 1: (Using Scrapy® with Selenium Web Driver)
Circumstance 2: (Using Scrapy with Scrapy-spalsh javascript rendering service)
Network Stability
Network Stability counts when your instance(computer) is too far away from your targeting site. Your network speed and latency will directly affect your spider and sometimes causes disaster. Low network speed will slow down your request speed, while latency will sometimes cause your spider failed to load your target webpage. This will led to error in future content extraction. Your program may catch exceptions and quit immedialy and if you are not using modern spider framwork, the error page will not be re-fetched in the future, so you will lose some data. Compared to a home-network-spider, deploy a spider on public cloud is a better solution.
Anti-antispider techniques
- IP Rotation:
- UA Rotation:
- Download Delay:
Extraction Method
This is really a borad topic, You can use fast-performance-techniques to locate elements like xpaths, bs4, css.... while you can also use low-performance-techniques like Deep Learning or Search even regular expression
Beautiful Soup (aka bs4):
- Intro-bs4
BeutifulSoup is not a good solution if you want to parse complicate website HTML. It does not support xpath or css selectors so that you will have to manually figure out the website's element hierarchic and result in some code likes this.
for level1 in bs_obj.find_all("div", {"id", "classname"}) for level 2 in bs_obj.find_all("div", {"class", "classname"}) for level3 in bs_obj.find_all("a", {"class", "classname"}) for level4 ........ .... ....
Xpath and CSS selectors:
Regular Expression (aka re):
- This is a good way to extract content with a specified string pattern but it is slower than others.
- Regular Expression Tutorial
Scrapy is integrated with xpath, css, re through parsel, you can check this tutorial to lean how to use them within scrapy
Question 2: Scrapy vs bs4 for scraping quotes from web
Scrapy
is a scraping framwork whilebs4
is a content extracting framework, so the answer is you canuse BeautifulSoup in Scrapy
.- There is also some User Friendly Scraping software
- 7 Tools for web scraping
- Even scrapy company (scrapinghub) is developing their own open source H5 based scraper: Portia