25

I am using Scrapy to crawl several websites, which may share redundant information.

For each page I scrape, I store the url of the page, its title and its html code, into mongoDB. I want to avoid duplication in database, thus, I implement a pipeline in order to check if a similar item is already stored. In such a case, I raise a DropItem exception.

My problem is that whenever I drop an item by raison a DropItem exception, Scrapy will display the entire content of the item into the log (stdout or file). As I'm extracting the entire HTML code of each scraped page, in case of a drop, the whole HTML code will be displayed into the log.

How could I silently drop an item without its content being shown?

Thank you for your time!

class DatabaseStorage(object):
    """ Pipeline in charge of database storage.

    The 'whole' item (with HTML and text) will be stored in mongoDB.
    """

    def __init__(self):
        self.mongo = MongoConnector().collection

    def process_item(self, item, spider):
        """ Method in charge of item valdation and processing. """
        if item['html'] and item['title'] and item['url']:
            # insert item in mongo if not already present
            if self.mongo.find_one({'title': item['title']}):
                raise DropItem('Item already in db')
            else:
                self.mongo.insert(dict(item))
                log.msg("Item %s scraped" % item['title'],
                    level=log.INFO, spider=spider)
        else:
            raise DropItem('Missing information on item %s' % (
                'scraped from ' + item.get('url')
                or item.get('title')))
        return item
Balthazar Rouberol
  • 5,882
  • 2
  • 30
  • 41

5 Answers5

22

The proper way to do this looks to be to implement a custom LogFormatter for your project, and change the logging level of dropped items.

Example:

from scrapy import log
from scrapy import logformatter

class PoliteLogFormatter(logformatter.LogFormatter):
    def dropped(self, item, exception, response, spider):
        return {
            'level': log.DEBUG,
            'format': logformatter.DROPPEDFMT,
            'exception': exception,
            'item': item,
        }

Then in your settings file, something like:

LOG_FORMATTER = 'apps.crawler.spiders.PoliteLogFormatter'

I had bad luck just returning "None," which caused exceptions in future pipelines.

jimmytheleaf
  • 261
  • 4
  • 3
  • 1
    where does this go? Middlewares? Pipelines? – Xodarap777 Sep 01 '15 at 00:30
  • 2
    @Xodarap777, I think `middlewares.py` file is more appropriate. Or you can create new file such as `logformatter.py`. The code from this answer offers to put code in the file with spider. **Note**: this code is deprecated, but the answer of @mirosval in the below has updated working version. – kupgov Dec 12 '16 at 09:45
19

In recent Scrapy versions, this has been changed a bit. I copied the code from @jimmytheleaf and fixed it to work with recent Scrapy:

import logging
from scrapy import logformatter


class PoliteLogFormatter(logformatter.LogFormatter):
    def dropped(self, item, exception, response, spider):
        return {
            'level': logging.INFO,
            'msg': logformatter.DROPPEDMSG,
            'args': {
                'exception': exception,
                'item': item,
            }
        }
kupgov
  • 293
  • 5
  • 14
mirosval
  • 6,232
  • 3
  • 29
  • 44
  • 5
    It worked great for me! I would suggest changing `'level': logging.INFO,` to `'level': logging.DEBUG,` and mentioning the `LOG_FORMATTER = '...PoliteLogFormatter'` in the settings.py file – UriCS Dec 16 '16 at 01:07
11

Ok, I found the answer before even posting the question. I still think that the answer might be valuable to anyone having the same problem.

Instead of dropping the object with a DropItem exception, you just have to return a None value:

def process_item(self, item, spider):
    """ Method in charge of item valdation and processing. """
    if item['html'] and item['title'] and item['url']:
        # insert item in mongo if not already present
        if self.mongo.find_one({'url': item['url']}):
            return
        else:
            self.mongo.insert(dict(item))
            log.msg("Item %s scraped" % item['title'],
                level=log.INFO, spider=spider)
    else:
        raise DropItem('Missing information on item %s' % (
           'scraped from ' + item.get('url')
            or item.get('title')))
    return item
Balthazar Rouberol
  • 5,882
  • 2
  • 30
  • 41
  • 1
    Doing this outputs a debug level log entry containing the string 'None' instead of a warning level log entry containing the dropped item. It's a fair solution at `--loglevel=INFO` or above. Ideally, `scrapy.core.scraper.Scraper` should allow easy access to configuration of the output in `_itemproc_finished`. – jah Jun 26 '13 at 02:01
  • @jah is correct. "jimmytheleaf"'s solution is the correct one in this instance. – Darian Moody Mar 30 '15 at 20:17
  • The problem of return the `None` object is that may not work with some middleware extensions, even core extensions like the `FeedExporter` class, responsible to export the results into a file when you use the `-o file.csv` option, you will see a lot of error in your log that the `None` object cannot be serialized – Mariano Ruiz Nov 30 '18 at 17:32
  • 1
    This is bad idea as returning `None` does not stop pipeline process and further processing still happens. i.e. if you return `None` in pipelineA, pipelineB will still process `None` in `process_item` and most likely break as it's expecting an `Item` or a `dict` not a `None`. – Granitosaurus Dec 10 '18 at 02:36
1

Another solution to this problem is to adjust repr method in scrapy.Item subclass

class SomeItem(scrapy.Item):
    scrape_date = scrapy.Field()
    spider_name = scrapy.Field()
    ...

    def __repr__(self):
        return ""

This way the item will not show up at all in the logs.

Levon
  • 6,046
  • 2
  • 35
  • 36
1

As Levon indicates in its previous comment, it is possible too to overload the __repr__ function of the Item you are processing.

This way, the message will be displayed in the Scrapy log but and you wouldn't l you can control the length of the code to show in the log, for example, the first 150 characters of the web page. Assuming that you have an Item that represent an HTML page like this, the overload of __repr__ could be like the following:

class MyHTMLItem(Scrapy.Item):
    url = scrapy.Field()
    htmlcode = scrapy.Field()
    [...]
    def __repr__(self):
        s = ""
        s += "URL: %s\n" % self.get('URL')
        s += "Code (chunk): %s\n" % ((self.get('htmlcode'))[0:100])
        return s
Felipower
  • 91
  • 1
  • 6