I am using Scrapy to crawl several websites, which may share redundant information.
For each page I scrape, I store the url of the page, its title and its html code, into mongoDB.
I want to avoid duplication in database, thus, I implement a pipeline in order to check if a similar item is already stored. In such a case, I raise a DropItem
exception.
My problem is that whenever I drop an item by raison a DropItem
exception, Scrapy will display the entire content of the item into the log (stdout or file).
As I'm extracting the entire HTML code of each scraped page, in case of a drop, the whole HTML code will be displayed into the log.
How could I silently drop an item without its content being shown?
Thank you for your time!
class DatabaseStorage(object):
""" Pipeline in charge of database storage.
The 'whole' item (with HTML and text) will be stored in mongoDB.
"""
def __init__(self):
self.mongo = MongoConnector().collection
def process_item(self, item, spider):
""" Method in charge of item valdation and processing. """
if item['html'] and item['title'] and item['url']:
# insert item in mongo if not already present
if self.mongo.find_one({'title': item['title']}):
raise DropItem('Item already in db')
else:
self.mongo.insert(dict(item))
log.msg("Item %s scraped" % item['title'],
level=log.INFO, spider=spider)
else:
raise DropItem('Missing information on item %s' % (
'scraped from ' + item.get('url')
or item.get('title')))
return item