Scrapy store returned items in variables to use in main script

Question

I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes:

import scrapy
from scrapy.crawler import CrawlerProcess


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    custom_settings = {
        'LOG_ENABLED': 'False',
    }

    def parse(self, response):
        global title # This would work, but there should be a better way
        title = response.css('title::text').extract_first()

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start() # the script will block here until the crawling is finished

print(title) # Verify if it works and do some other actions later on...

This would work so far, but I am pretty sure it is not a good style, or even has some bad side effects if I define the title variable as global. If I skip that line, then I get the "undefined variable" error of course :/ Therefore I am searching for a way to return the variable and use it in my main script.

I have read about item pipeline but I was not able to make it work.

Any help/ideas are heavily appreciated :) Thanks in advance!

better use `global` - it will be easier. Pipeline will not help you. — furas, Dec 27 '17 at 14:33

score 3 · Answer 1 · answered Dec 27 '17 at 14:46

making a variable global should work for what you need, but as you mentioned it isn't of good style.

I would actually recommend using a different service for communication between processes, something like Redis, so you won't be having conflicts between your spider and any other process.

It is very simple to setup and use, the documentation has a very simple example.

Instantiate the redis connection inside the spider and again on the main process (think about them as separate processes). The spider sets the variables and the main process reads (or gets) the information.

Thanks, for the short term, I'll go with furas' and AndyWangs answer, but if I get the time I'll read into Redis :) — MaGi, Jan 01 '18 at 18:21

score 2 · Accepted Answer · answered Dec 29 '17 at 03:55

using global as you know is not a good style,especially while you need to extend your demand.

My suggestion is to store the title into file or list and use it in your main process,or if you want to handle the title in other script,then just open file and read title in your script

(Note: please ignore the indentation issue)

spider.py

import scrapy
from scrapy.crawler import CrawlerProcess

namefile = 'namefile.txt'
current_title_session = []#title stored in current session
file_append = open(namefile,'a',encoding = 'utf-8')

try:
    title_in_file = open(namefile,'r').readlines()
except:
    title_in_file = open(namefile,'w')

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    custom_settings = {
        'LOG_ENABLED': 'False',
    }

    def parse(self, response):
        title = response.css('title::text').extract_first()
        if title +'\n' not in title_in_file  and title not in current_title_session:
             file_append.write(title+'\n')
             current_title_session.append(title)
if __name__=='__main__':
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(QuotesSpider)
    process.start() # the script will block here until the crawling is finished

Thanks, this fixes the issue with the global statement, although I am not sure if it is elegant to create another file to process it. Anyways - this is working fine for me :-) — MaGi, Jan 01 '18 at 18:28

Scrapy store returned items in variables to use in main script

2 Answers2

Linked