scrapy - handling multiple types of items - multiple and related Django models and saving them to database in pipelines

Question

I have the following Django models. I am not sure what is the best way to save these inter-related objects when scanned in spider to the database in Django using scrapy pipelines. Seems like scrapy pipeline was built to handle only one 'kind' of item

models.py

class Parent(models.Model):
    field1 = CharField()


class ParentX(models.Model):
    field2 = CharField()
    parent = models.OneToOneField(Parent, related_name = 'extra_properties')


class Child(models.Model):
    field3 = CharField()
    parent = models.ForeignKey(Parent, related_name='childs')

items.py

# uses DjangoItem https://github.com/scrapy-plugins/scrapy-djangoitem

class ParentItem(DjangoItem):
    django_model = Parent

class ParentXItem(DjangoItem):
    django_model = ParentX

class ChildItem(DjangoItem):
    django_model = Child

spiders.py

class MySpider(scrapy.Spider):
    name = "myspider"
    allowed_domains = ["abc.com"]
    start_urls = [
        "http://www.example.com",       # this page has ids of several Parent objects whose full details are in their individual pages

    ]

    def parse(self, response):
        parent_object_ids = [] #list from scraping the ids of the parent objects

        for parent_id in parent_object_ids:
            url = "http://www.example.com/%s" % parent_id
            yield scrapy.Request(url, callback=self.parse_detail)

    def parse_detail(self, response):
        p = ParentItem()
        px = ParentXItem()
        c = ChildItem()



        # populate p, px and c1, c2 with various data from the response.body

        yield p
        yield px
        yield c1
        yield c2 ... etc c3, c4

pipelines.py -- not sure what to do here

class ScrapytestPipeline(object):
    def process_item(self, item, spider):


        # This is where typically storage to database happens
        # Now, I dont know whether the item is a ParentItem or ParentXItem or ChildItem

        # Ideally, I want to first create the Parent obj and then ParentX obj (and point p.extra_properties = px), and then child objects 
        # c1.parent = p, c2.parent = p

        # But I am not sure how to have pipeline do this in a sequential way from any order of items received

is `isintance(item, ParentItem)` helpful? – eLRuLL Nov 10 '15 at 21:34 — eLRuLL, Nov 10 '15 at 21:34
@dowjones123 did you solve the problem ? – Murat Kaya Jan 05 '17 at 23:33 — Murat Kaya, Jan 05 '17 at 23:33

score 0 · Answer 1 · edited May 23 '17 at 11:46

If you want to do-it in a sequential way, i supouse if you store one item inside the other, an depakage-it in the pipeline, it might work.

I think is easier to relate the objects before saving in db.

In spiders.py when u " populate p, px and c1, c2 with various data from the response.body" you can populate a "false" primary key constructed from the data of the object.

Then u can save the data and update-it in the model if is already scraped in only one pipeline:

class ItemPersistencePipeline(object):
    def process_item(self, item, spider):
        try:
             item_model = item_to_model(item)
        except TypeError:
            return item   
        model, created = get_or_create(item_model)
        try:
            update_model(model, item_model)
        except Exception,e:
            return e
        return item

of course the methods:

def item_to_model(item):
    model_class = getattr(item, 'django_model')
    if not model_class:
        raise TypeError("Item is not a `DjangoItem` or is misconfigured")   
    return item.instance   

def get_or_create(model):
    model_class = type(model)
    created = False
    try:
        #We have no unique identifier at the moment
        #use the model.primary for now
        obj = model_class.objects.get(primary=model.primary)
    except model_class.DoesNotExist:
        created = True
        obj = model  # DjangoItem created a model for us.

    return (obj, created)

from django.forms.models import model_to_dict

def update_model(destination, source, commit=True):
    pk = destination.pk

    source_dict = model_to_dict(source)
    for (key, value) in source_dict.items():
        setattr(destination, key, value)

    setattr(destination, 'pk', pk)

    if commit:
        destination.save()

    return destination

from: How to update DjangoItem in Scrapy

Also you should define the Field "primary" in the models of django to search if is already in the new item scraped

models.py

class Parent(models.Model):
    field1 = CharField()   
    #primary_key=True
    primary = models.CharField(max_length=80)
class ParentX(models.Model):
    field2 = CharField()
    parent = models.OneToOneField(Parent, related_name = 'extra_properties')
    primary = models.CharField(max_length=80) 
class Child(models.Model):
    field3 = CharField()
    parent = models.ForeignKey(Parent, related_name='childs')
    primary = models.CharField(max_length=80)

score 0 · Answer 2 · answered Jan 30 '19 at 12:11

As eLRuLL pointed out, you can use isinstance to tell which item you are parsing each time.

However, if you do not want to find yourself ever parsing a child item in the pipeline before its parents, consider using a single scrapy item for a combination of parent, parentX and child.

You might want to use nested items to do that cleanly.

Then, on your pipeline, take care of upserting the corresponding separate items into the database.

scrapy - handling multiple types of items - multiple and related Django models and saving them to database in pipelines

models.py

items.py

spiders.py

pipelines.py -- not sure what to do here

2 Answers2