4

I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes:

import scrapy
from scrapy.crawler import CrawlerProcess


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    custom_settings = {
        'LOG_ENABLED': 'False',
    }

    def parse(self, response):
        global title # This would work, but there should be a better way
        title = response.css('title::text').extract_first()

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start() # the script will block here until the crawling is finished

print(title) # Verify if it works and do some other actions later on...

This would work so far, but I am pretty sure it is not a good style, or even has some bad side effects if I define the title variable as global. If I skip that line, then I get the "undefined variable" error of course :/ Therefore I am searching for a way to return the variable and use it in my main script.

I have read about item pipeline but I was not able to make it work.

Any help/ideas are heavily appreciated :) Thanks in advance!

MaGi
  • 89
  • 1
  • 1
  • 7

2 Answers2

3

making a variable global should work for what you need, but as you mentioned it isn't of good style.

I would actually recommend using a different service for communication between processes, something like Redis, so you won't be having conflicts between your spider and any other process.

It is very simple to setup and use, the documentation has a very simple example.

Instantiate the redis connection inside the spider and again on the main process (think about them as separate processes). The spider sets the variables and the main process reads (or gets) the information.

eLRuLL
  • 17,114
  • 8
  • 67
  • 91
  • Thanks, for the short term, I'll go with furas' and AndyWangs answer, but if I get the time I'll read into Redis :) – MaGi Jan 01 '18 at 18:21
2

using global as you know is not a good style,especially while you need to extend your demand.

My suggestion is to store the title into file or list and use it in your main process,or if you want to handle the title in other script,then just open file and read title in your script

(Note: please ignore the indentation issue)

spider.py

import scrapy
from scrapy.crawler import CrawlerProcess

namefile = 'namefile.txt'
current_title_session = []#title stored in current session
file_append = open(namefile,'a',encoding = 'utf-8')

try:
    title_in_file = open(namefile,'r').readlines()
except:
    title_in_file = open(namefile,'w')

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    custom_settings = {
        'LOG_ENABLED': 'False',
    }

    def parse(self, response):
        title = response.css('title::text').extract_first()
        if title +'\n' not in title_in_file  and title not in current_title_session:
             file_append.write(title+'\n')
             current_title_session.append(title)
if __name__=='__main__':
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(QuotesSpider)
    process.start() # the script will block here until the crawling is finished
K. Andy wang
  • 342
  • 1
  • 2
  • 11
  • Thanks, this fixes the issue with the global statement, although I am not sure if it is elegant to create another file to process it. Anyways - this is working fine for me :-) – MaGi Jan 01 '18 at 18:28