24

I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider, which, according to it's github page, is fresh, actively developed and popular.

pyspider's home page lists several things being supported out-of-the-box:

  • Powerful WebUI with script editor, task monitor, project manager and result viewer

  • Javascript pages supported!

  • Task priority, retry, periodical and recrawl by age or marks in index page (like update time)

  • Distributed architecture

These are the things that Scrapy itself doesn't provide, but, it is possible with the help of portia (for Web UI), scrapyjs (for js pages) and scrapyd (deploying and distributing through API).

Is it true that pyspider alone can replace all of these tools? In other words, is pyspider a direct alternative to Scrapy? If not, then which use cases does it cover?

I hope I'm not crossing "too broad" or "opinion-based" line.

alecxe
  • 414,977
  • 106
  • 935
  • 1,083
  • 1
    This is pretty close to the opinion-based line. I'm not sure if I'd consider it over it. – Amber Dec 02 '14 at 06:43
  • @Amber thanks, I was worried about it. Tried to add specifics. (at least it is more detailed and specific than [Is it worth learning Scrapy?](http://stackoverflow.com/questions/6283271/is-it-worth-learning-scrapy)). – alecxe Dec 02 '14 at 06:45
  • @Amber I guess I've got the best answer I could possibly have here. Binux is the inventor and maintainer of pyspider project. Hope this thread would be a starting point for those who would have questions about the differences between scrapy and pyspider. – alecxe Dec 02 '14 at 18:10
  • @alecxe Would love a report back on your experience with pyspider given your more extensive experience with Scrapy. – chishaku Jan 08 '15 at 11:02
  • @chishaku this is a good idea, I think I'll provide an answer some day with my own observations and feelings about it. Thank you! – alecxe Jan 08 '15 at 13:03
  • Did you try pyspider? at this time I getting to your decision point? can you inform me about the result of your decision? – Yuseferi Apr 26 '17 at 07:47
  • @zhilevan unfortunately, I have not moved beyond the tutorial, but I had a pleasant experience and positive impressions. I suggest you do the same - go over both Scrapy and PySpider tutorials to have a feel of the architectures. Thanks. – alecxe Apr 26 '17 at 07:58

2 Answers2

29

pyspider and Scrapy have the same purpose, web scraping, but a different view about doing that.

  • spider should never stop till WWW dead. (information is changing, data is updating in websites, spider should have the ability and responsibility to scrape latest data. That's why pyspider has URL database, powerful scheduler, @every, age, etc..)

  • pyspider is a service more than a framework. (Components are running in isolated process, lite - all version is running as service too, you needn't have a Python environment but a browser, everything about fetch or schedule is controlled by script via API not startup parameters or global configs, resources/projects is managed by pyspider, etc...)

  • pyspider is a spider system. (Any components can been replaced, even developed in C/C++/Java or any language, for better performance or larger capacity)

and

  • on_start vs start_url
  • token bucket traffic control vs download_delay
  • return json vs class Item
  • message queue vs Pipeline
  • built-in url database vs set
  • Persistence vs In-memory
  • PyQuery + any third package you like vs built-in CSS/Xpath support

In fact, I have not referred much from Scrapy. pyspider is really different from Scrapy.

But, why not try it yourself? pyspider is also fast, has easy-to-use API and you can try it without install.

halfer
  • 18,701
  • 13
  • 79
  • 158
Binux
  • 697
  • 6
  • 12
  • @Binux: I'd like to see a new web scraping tool, excellent work. But, why not python3? Python 2 is the past, that's why I abandoned Scrapy – Jedi Dec 11 '14 at 00:34
  • @Jedi I'm more familiar with python 2.7 and pyspider is first made 2 years ago with python 2.7. I want to start from where I'm more familiar with and focus on the architecture. I will make python 3 supported before v0.5.0 – Binux Dec 11 '14 at 03:33
  • 13
    It looks like you are the author of the tool you recommend. That's fine, but can you add a full disclosure note when you do so? – halfer Jun 17 '15 at 23:42
7

Since I use both scrapy and pyspider, I would like to suggest the following:

If the website is really small / simple, try pyspider first since it has almost everything you need

  • Use webui to setup project
  • Try the online code editor and view parse result instantly
  • View the result easily in browser
  • Run/Pause the project
  • Setup the expiration date so it can re-process the url

However, if you tried pyspider and found it can't fit your needs, it's time to use scrapy. - migrate on_start to start_request - migrate index_page to parse - migrate detail_age to detail_age - change self.crawl to response.follow

Then you are almost done. Now you can play with scrapy's advanced features like middleware, items, pipline etc.

Kai Huang
  • 81
  • 1
  • 1