Saving the output of spider in a variable rather than in a file

Question

I am looking for a way to save the spider output in a python variable instead of saving it in a json file and reading it back in the program.

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.wikipedia.org']

    def parse(self, response):
        yield {
                'text' : response.css(".jsl10n.localized-slogan::text").extract_first()
             }

if __name__ == "__main__":
    os.remove('result.json')
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'FEED_FORMAT': 'json',
        'FEED_URI': 'result.json'
    })

    process.crawl(TestSpider)
    process.start()

I want to avoid the below step and directly read the value instead of saving it on disk first

with io.open('result.json', encoding='utf-8') as json_data:
        d = json.load(json_data)
        text = d[0]['text']

I think this is helpful. https://stackoverflow.com/questions/47993380/scrapy-store-returned-items-in-variables-to-use-in-main-script/48017202#48017202 — K. Andy wang, Feb 02 '18 at 05:03

score 2 · Accepted Answer · answered Feb 03 '18 at 14:07

I ended up using global variable to store the output which solves my purpose.

import scrapy
from scrapy.crawler import CrawlerProcess

outputResponse = {}

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.wikipedia.org']

    def parse(self, response):
        global outputResponse
        outputResponse['text'] = response.css(".jsl10n.localized-slogan::text").extract_first()

if __name__ == "__main__":
    os.remove('result.json')
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    })

    process.crawl(TestSpider)
    process.start()

score 0 · Answer 2 · answered May 26 '19 at 09:12

You can also pass an object into spider and change it, like this:

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.wikipedia.org']

    def parse(self, response):
        self.outputResponse['text'] = response.css(".jsl10n.localized-slogan::text").extract_first()

if __name__ == "__main__":
    os.remove('result.json')

    outputResponse = {}

    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    })
    process.crawl(TestSpider, outputResponse=outputResponse)
    process.start()

This works, because every named argument passed to spider constructor is assigned to an instance as an attribute, that's why you can use self.outputResponse inside parse method and have access to an external object.

Saving the output of spider in a variable rather than in a file

2 Answers2