1

I'm looking for a simple tutorial explaining how to write items to Rethinkdb from scrapy. The equivalent can be found for MongoDB here.

crocefisso
  • 557
  • 1
  • 5
  • 20

1 Answers1

2

Here is a translation of "Write items to MongoDB" line for line with RethinkDB.

A couple notes:

  • I'm not sure where crawler.settings are set.
  • The scrapy docs say process_item's second param item can be an object or dict, so the .insert(dict(item)) cast/conversion is probably necessary.

import rethinkdb as r

class RethinkDBPipeline(object):

    table_name = 'scrapy_items'

    def __init__(self, rethinkdb_uri, rethinkdb_port, rethinkdb_db):
        self.rethinkdb_uri = rethinkdb_uri
        self.rethinkdb_port = rethinkdb_port
        self.rethinkdb_db = rethinkdb_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            rethinkdb_uri=crawler.settings.get('RETHINKDB_URI'),
            rethinkdb_db=crawler.settings.get('RETHINKDB_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.conn = r.connect(
            host = self.rethinkdb_uri, 
            port = self.rethinkdb_port,
            db = self.rethinkdb_db)

    def close_spider(self, spider):
        self.conn.close()

    def process_item(self, item, spider):
        r.table(self.table_name).insert(dict(item)).run(self.conn)
        return item
dalanmiller
  • 2,897
  • 4
  • 24
  • 35
  • thank you for you code, unfortunately I was unable to implement it. I think I have to deepen my understanding of RethinkDB first... Crawler settings are set in setting.py – crocefisso Apr 22 '16 at 23:27
  • @crocefisso, let me know if this works eventually I'd love to post something showing how to setup RethinkDB with scrapy based on this! – dalanmiller Apr 25 '16 at 17:30
  • @dalanmiler, I'm currently studying data science and I'm learning Python, Scrapy and RethinkDB from scratch since I have no background in computing (exept hobbying). For some class I strated a project using RethinkDB and getting my data from scrapy. As time was limited I did not have the time to implement a pipeline in scrapy storing items to RinthinkDB as I wanted. I just ended using rethinkdb import function. But for my next project (in few weeks) I'm planning to try again, and study more thoroughly the question. As soon as I'm into it again I'll share my finding with you. – crocefisso Apr 25 '16 at 20:05
  • Great to hear @crocefisso! Check out http://slack.rethinkdb.com to join our Slack channel if you want some more immediate help. – dalanmiller Apr 26 '16 at 15:07
  • 1
    In addition to RETHINKDB_URI and RETHINKDB_DATABASE, I added a RETHINKDB_PORT setting to my RethinkDBPipeline and it worked great. Also - a side note - if you are using the conda package manager rethinkdb isn't available.. I just copied rethinkdb out of the Site-Packages directory from a default Python (3.5) distribution into the Miniconda Site-Packages directory and it had no issues. – zulumojo Jul 06 '16 at 17:56