Questions tagged [scrapyd]

`Scrapyd` is a daemon for managing `Scrapy` projects. The project used to be part of `scrapy` itself, but was separated out and is now a standalone project. It runs on a machine and allows you to deploy (aka. upload) your projects and control the spiders they contain using a JSON web service.

Scrapyd can manage multiple projects and each project can have multiple versions uploaded, but only the latest one will be used for launching new spiders.

349 questions
51
votes
5 answers

Scrapy get request url in parse

How can I get the request url in Scrapy's parse() function? I have a lot of urls in start_urls and some of them redirect my spider to homepage and as result I have an empty item. So I need something like item['start_url'] = request.url to store…
Goran
  • 5,427
  • 9
  • 30
  • 51
26
votes
1 answer

ScrapyRT vs Scrapyd

We've been using Scrapyd service for a while up until now. It provides a nice wrapper around a scrapy project and its spiders letting to control the spiders via an HTTP API: Scrapyd is a service for running Scrapy spiders. It allows you to deploy…
alecxe
  • 414,977
  • 106
  • 935
  • 1,083
25
votes
4 answers

How to setup and launch a Scrapy spider programmatically (urls and settings)

I've written a working crawler using scrapy, now I want to control it through a Django webapp, that is to say: Set 1 or several start_urls Set 1 or several allowed_domains Set settings values Start the spider Stop / pause / resume a…
arno
  • 497
  • 1
  • 5
  • 14
15
votes
1 answer

learning python and also trying to implement scrapy ..getting this error

I am going through the scrapy tutorial http://doc.scrapy.org/en/latest/intro/tutorial.html and I followed it till I ran this command scrapy crawl dmoz And it gave me output with an error 2013-08-25 13:11:42-0700 [scrapy] INFO: Scrapy 0.18.0 started…
Asim Zaidi
  • 23,590
  • 46
  • 125
  • 213
13
votes
1 answer

Scrapy spider memory leak

My spider have a serious memory leak.. After 15 min of run its memory 5gb and scrapy tells (using prefs() ) that there 900k requests objects and thats all. What can be the reason for this high number of living requests objects? Request only goes up…
Aldarund
  • 14,747
  • 4
  • 55
  • 86
13
votes
3 answers

Running Multiple Scrapy Spiders (the easy way) Python

Scrapy is pretty cool, however I found the documentation to very bare bones, and some simple questions were tough to answer. After putting together various techniques from various stackoverflows I have finally come up with an easy and not overly…
InfinteScroll
  • 656
  • 3
  • 11
  • 21
11
votes
2 answers

Run multiple scrapy spiders at once using scrapyd

I'm using scrapy for a project where I want to scrape a number of sites - possibly hundreds - and I have to write a specific spider for each site. I can schedule one spider in a project deployed to scrapyd using: curl…
user1009453
  • 687
  • 2
  • 11
  • 27
10
votes
2 answers

Parallelism/Performance problems with Scrapyd and single spider

Context I am running scrapyd 1.1 + scrapy 0.24.6 with a single "selenium-scrapy hybrid" spider that crawls over many domains according to parameters. The development machine that host scrapyd's instance(s?) is an OSX Yosemite with 4 cores and this…
gerosalesc
  • 2,613
  • 3
  • 21
  • 41
10
votes
1 answer

what are the advantages use scrapyd?

The scrapy doc says that: Scrapy comes with a built-in service, called “Scrapyd”, which allows you to deploy (aka. upload) your projects and control their spiders using a JSON web service. is there some advantages in comformance use scrapyd?
gnemoug
  • 347
  • 1
  • 4
  • 19
9
votes
1 answer

scrapyd-client command not found

I'd just installed the scrapyd-client(1.1.0) in a virtualenv, and run command 'scrapyd-deploy' successfully, but when I run 'scrapyd-client', the terminal said: command not found: scrapyd-client. According to the readme…
dropax
  • 115
  • 1
  • 7
8
votes
3 answers

Scrapyd jobid value inside spider

Framework Scrapy - Scrapyd server. I have some problem with getting jobid value inside the spider. After post data to http://localhost:6800/schedule.json the response is status = ok jobid = bc2096406b3011e1a2d0005056c00008 But I need use this…
fcmax
  • 315
  • 3
  • 10
8
votes
2 answers

Scrapy 's Scrapyd too slow with scheduling spiders

I am running Scrapyd and encounter a weird issue when launching 4 spiders at the same time. 2012-02-06 15:27:17+0100 [HTTPChannel,0,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-"…
Sjaak Trekhaak
  • 4,596
  • 27
  • 36
8
votes
2 answers

Scrapyd-deploy command not found after scrapyd installation

I have created a couple of web spiders that I intend to run simultaneously with scrapyd. I first successfully installed scrapyd in Ubuntu 14.04 using the command: pip install scrapyd, and when I run the command: scrapyd, I get the following output…
loremIpsum1771
  • 2,277
  • 3
  • 29
  • 68
7
votes
0 answers

scrapyd: is it possible to return ERROR status for a job

I have an application which schedules scrapy crawl jobs via scrapyd. Items flow nicely to the DB, and I can monior the job status via the listjobs.json endpoint.So far so good, and I can even know when jobs are finished. However, sometimes jobs can…
Oren Yosifon
  • 671
  • 6
  • 18
7
votes
1 answer

Horizontally scaling Scrapyd

What tool or set of tools would you use for horizontally scaling scrapyd adding new machines to a scrapyd cluster dynamically and having N instances per machine if required. Is not neccesary for all the instances to share a common job queue, but…
gerosalesc
  • 2,613
  • 3
  • 21
  • 41
1
2 3
23 24