I don't see any good reason to do what you do :) You can change the json_filename
setting by setting arguments on your scrapyd schedule.json
Request. Then you can make each spider to generate slightly different files that you merge with post-processing or at query time. You can also write JSON files similar to what you have by just setting the FEED_URI
value (example). If you write to single file simultaneously from multiple processes (especially when you open with 'wb'
mode) you're looking for corrupt data.
Edit:
After understanding a bit better what you need - in this case - it's scrapyd starting multiple crawls running different spiders where each one crawls a different website. The consumer process is monitoring a single file continuously.
There are several solutions including:
Relatively easy to implement and ok for very small Items only (see here)
- RabbitMQ or some other queueing mechanism
Great solution but might be a bit of an overkill
- A database e.g. SQLite based solution
Nice and simple but likely requires some coding (custom consumer)
- A nice
inotifywait
-based or other filesystem monitoring solution
Nice and likely easy to implement
The last one seems like the most attractive option to me. When scrapy crawl
finishes (spider_closed signal), move, copy or create a soft link for the FEED_URL
file to a directory that you monitor with a script like this. mv
or ln
is an atomic unix operation so you should be fine. Hack the script to append the new file on your tmp
file that you feed once to your consumer program.
By using this way, you use the default feed exporters to write your files. The end-solution is so simple that you don't need a pipeline. A simple Extension should fit the bill.
On an extensions.py
in the same directory as settings.py
:
from scrapy import signals
from scrapy.exceptions import NotConfigured
class MoveFileOnCloseExtension(object):
def __init__(self, feed_uri):
self.feed_uri = feed_uri
@classmethod
def from_crawler(cls, crawler):
# instantiate the extension object
feed_uri = crawler.settings.get('FEED_URI')
ext = cls(feed_uri)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
# return the extension object
return ext
def spider_closed(self, spider):
# Move the file to the proper location
# os.rename(self.feed_uri, ... destination path...)
On your settings.py
:
EXTENSIONS = {
'myproject.extensions.MoveFileOnCloseExtension': 500,
}