1

I have scrapyd server with several spiders running at same time, I start the spiders one by one using the schedule.json endpoint. All spiders are writing contents on common file using a pipeline

class JsonWriterPipeline(object):

def __init__(self, json_filename):
    # self.json_filepath = json_filepath
    self.json_filename = json_filename
    self.file = open(self.json_filename, 'wb')

@classmethod
def from_crawler(cls, crawler):
    save_path='/tmp/'
    json_filename=crawler.settings.get('json_filename', 'FM_raw_export.json')
    completeName = os.path.join(save_path, json_filename) 
    return cls(
        completeName
    )

def process_item(self, item, spider):
    line = json.dumps(dict(item)) + "\n"
    self.file.write(line)
    return item

After the spiders are running I can see how they are collecting data correctly, items are stored in files XXXX.jl and the spiders works correctly, however the contents crawled are not reflected on common file. Spiders seems to work well however the pipeline is not doing well their job and is not collecting data into common file.

I also noticed that only one spider is writing at same time on file.

1 Answers1

1

I don't see any good reason to do what you do :) You can change the json_filename setting by setting arguments on your scrapyd schedule.json Request. Then you can make each spider to generate slightly different files that you merge with post-processing or at query time. You can also write JSON files similar to what you have by just setting the FEED_URI value (example). If you write to single file simultaneously from multiple processes (especially when you open with 'wb' mode) you're looking for corrupt data.

Edit:

After understanding a bit better what you need - in this case - it's scrapyd starting multiple crawls running different spiders where each one crawls a different website. The consumer process is monitoring a single file continuously.

There are several solutions including:

  • named pipes

Relatively easy to implement and ok for very small Items only (see here)

  • RabbitMQ or some other queueing mechanism

Great solution but might be a bit of an overkill

  • A database e.g. SQLite based solution

Nice and simple but likely requires some coding (custom consumer)

  • A nice inotifywait-based or other filesystem monitoring solution

Nice and likely easy to implement

The last one seems like the most attractive option to me. When scrapy crawl finishes (spider_closed signal), move, copy or create a soft link for the FEED_URL file to a directory that you monitor with a script like this. mv or ln is an atomic unix operation so you should be fine. Hack the script to append the new file on your tmp file that you feed once to your consumer program.

By using this way, you use the default feed exporters to write your files. The end-solution is so simple that you don't need a pipeline. A simple Extension should fit the bill.

On an extensions.py in the same directory as settings.py:

from scrapy import signals
from scrapy.exceptions import NotConfigured

class MoveFileOnCloseExtension(object):

    def __init__(self, feed_uri):
        self.feed_uri = feed_uri

    @classmethod
    def from_crawler(cls, crawler):
        # instantiate the extension object
        feed_uri = crawler.settings.get('FEED_URI')
        ext = cls(feed_uri)

        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

        # return the extension object
        return ext

    def spider_closed(self, spider):
        # Move the file to the proper location
        # os.rename(self.feed_uri, ... destination path...)

On your settings.py:

EXTENSIONS = {
    'myproject.extensions.MoveFileOnCloseExtension': 500,
}
Community
  • 1
  • 1
neverlastn
  • 2,054
  • 14
  • 20
  • Thanks for your answer. I think I don't understand well how pipelines work in scrapy. From my understanding, the spiders generates items, the items of all spiders are passed to pipeline, that is global to all of them, and process them. From your answer I'm guessing that each spider has in some how its own pipeline, and there is several pipelines working with same file, and this is causing troubles. Am I right? . I'm doing it in this way because there is another process that do some stuff with the contents and only accepts one file. – silvestrelosada Mar 25 '16 at 10:55
  • The phrasing isn't 100% accurate, but what you say is right. More specifically scrapyd runs multiple scrapy processes in parallel. Each scrapy process runs a spider and your pipeline. This means while doing this crawl, you will be opening the same file for write from many processes which will lead to corruption. "...only accepts one file". I'm certain there's a better way to do this. Can you run scrapyd, get e.g. 100 files out of it, [concat them to one](http://stackoverflow.com/questions/4969641/append-one-file-to-another-in-linux) and then feed that to your process? This is way more common. – neverlastn Mar 26 '16 at 12:36
  • Do you know some option to put the result (items) of all spiders on same file using just scrappy? Thanks – silvestrelosada Mar 28 '16 at 09:57
  • Thanks I'll send the email you can remove it – silvestrelosada Mar 28 '16 at 11:26
  • Thanks for the updated answer. I like what you propose, I'll go with your propose solution or rabbit one. Rabbit has another advantage, In case I'll need to put the spiders on different server, for scaling proposes it is easy to do, and I could happened in the future. But in any case I'll use the extensions for simplicity or rabbit for robutnes and scalability. – silvestrelosada Mar 29 '16 at 08:07
  • Awesome! :) For best performance with RabbitMQ make sure you publish messages asynchronously ([example](http://pika.readthedocs.org/en/latest/examples/asynchronous_publisher_example.html) - if your are using the ack mechanism - `on_delivery_confirmation()` method) – neverlastn Mar 29 '16 at 09:48