I need a web scraper to use for a performance critical project?

Question

Can anybody recommend a web parser(python or node.js) to be used in project that requires speed? I'm currently using bs4(lxml) but it doesn't seem to be the fastest. You can see here a speed test comparison between bs4 and pure lxml: https://edmundmartin.com/beautiful-soup-vs-lxml-speed/

@Prune Thanks! I'll delete the question afterwards, just wanted to get some quick answers. — Nain, Jun 20 '19 at 20:04

score 0 · Answer 1 · answered Jun 20 '19 at 19:50

I highly recommend scrapy!

https://scrapy.org/

Its a Python library which is build for speed. I recently made a scraper to download every page from a certain website and build a custom database from it and here are some stats from the process:

{'downloader/request_bytes': 13544866,
 'downloader/request_count': 36798,
 'downloader/request_method_count/GET': 36798,
 'downloader/response_bytes': 170688438,
 'downloader/response_count': 36798,
 'downloader/response_status_count/200': 36780,
 'downloader/response_status_count/301': 17,
 'downloader/response_status_count/302': 1,
 'dupefilter/filtered': 22358,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 6, 12, 20, 11, 41, 103069),
 'item_scraped_count': 15160,
 'log_count/DEBUG': 51960,
 'log_count/ERROR': 64,
 'log_count/INFO': 29,
 'request_depth_max': 4,
 'response_received_count': 36780,
 'scheduler/dequeued': 36796,
 'scheduler/dequeued/memory': 36796,
 'scheduler/enqueued': 36796,
 'scheduler/enqueued/memory': 36796,
 'spider_exceptions/JSONDecodeError': 64,
 'start_time': datetime.datetime(2019, 6, 12, 19, 51, 27, 87242)}

Overall it made 36,798 requests and processed 15,160 pages into my output. All in all it took 20 minutes.

Elegant Code · Accepted Answer · 2019-06-20T21:14:58.470

I believe you could use the lxml parser itself (https://lxml.de/index.html) without the bs4 wrapper to get slightly better performance then bs4 with a lxml parser backend.

However, and this is a matter of opinion and personal preference, which I think are not very welcome here on SO. I think the bs4 interface is worth the performance overhead. Moreover, I prefer the built-in python html.parser (as long as you have a recent version) as I like the way it handles ill-formatted webpages, and there are plenty of those on the web. When a web page is ill-formatted, every parser will give a different output, html.parser seems to me to be the most useful (doesn't discard all useful info, doesn't flood you with gibberish). Again very much a matter of preference and use case.

Two notes: From what I've seen, StackOverflow doesn't seem to like recommendation questions as they are subjective and invite heated debates, so don't be surprised if your question gets flagged. They might also offer ways to improve your question. Edit: You might be able to rephrase the question as something like: "I am currently using tool x and on an input of size y it takes z time to finish. Does anyone know how I can speed it up?" Hope this becomes acceptable.

Second note, you might want to change the word scraper in your title to parser to get people who click on it to know better what to expect.

Good Luck.

@Nain You might be able to rephrase the question as something like: "I am currently using tool x and on an input of size y it takes z time to finish. Does anyone know how I can speed it up?" Hope this becomes acceptable. — Elegant Code, Jun 20 '19 at 21:11

I need a web scraper to use for a performance critical project?

2 Answers2