1

I wanted to find out how to extract the quotes and authors from the first page of http://quotes.toscrape.com/ ONLY if the author's name is not Albert Einstein.

<div class="quote">
    <span class="text">
        "some quote"
    </span
    <span>
        "by "
        <small class="author">Albert Einstein</small>
    </span>
    <span class="text">
        "some quote"
    </span
    <span>
        "by "
        <small class="author">J.K. Rowling</small>
    </span>

I've done some searching and the closest things I can find are these posts, but these only refers to not scraping if the attribute is not equal to something and not if the value is not equal to something.

1 XPath for elements with attribute not equal or does not exist
2 Xpath test for ancestor attribute not equal string
3 How to use "not" in xpath?
4 Using not() in XPath

I currently have...

class AllSpider(scrapy.Spider):
    name = 'working'
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

def parse(self, response):
    divs = response.xpath("//div[@class='quote']")
    for div in divs:
        l = ItemLoader(item=AllItems(), selector=div)
        l.add_xpath('title', ".//span[@class='text']/text()")
        l.add_xpath('name', ".//small[@class='author']/text()")
        yield l.load_item()

class AllItems(scrapy.Item):
    link = scrapy.Field()
    title = scrapy.Field()
    name = scrapy.Field()
    domain = scrapy.Field()

and have tried the following, but it doesn't seem to do anything and I get the same results as without the added code. Any help would be appreciated!!! The only other way I could think of doing this is post crawl when I can use pandas to filter the outputted .csv file, but if there's a way to do it through scrapy, I would love to learn it!

def parse(self, response):
    divs = response.xpath("//div[@class='quote']")
    for div in divs:
        l = ItemLoader(item=AllItems(), selector=div)

        if l.add_xpath('name', ".//small[@class='author']/text()") != 'Albert Einstein':

            l.add_xpath('title', ".//span[@class='text']/text()")
            l.add_xpath('name', ".//small[@class='author']/text()")
            yield l.load_item()
carwave
  • 13
  • 2
  • Not an answer to your question but I would just do some post filtering on the output data. Not much point adding complex filtering logic in your scraper. – NomadMonad Apr 21 '20 at 01:04
  • Not sure how the answer might look like, but if it's an easy solution, I would like to add that so I wouldn't have to take the extra step of post filtering. – carwave Apr 21 '20 at 01:35

2 Answers2

0

Try to copy and paste this:

l.add_xpath('name', ".//small[@class='author'][not(contains(., 'Albert Einstein'))]/text()")

dram95
  • 210
  • 1
  • 10
  • Thanks for that! While it does work, when I output to csv, there's a blank cell for every time Albert Einstein was on. – carwave Apr 25 '20 at 03:04
0

So playing around with it, I found the best way to do it would be either of these solutions. The first is if you have a single value to filter out while the 2nd is if you have a list of values you want to filter out. Thanks everyone who helped me out!!!

def parse(self, response):
    divs = response.xpath("//div[@class='quote']")
    for div in divs:
        l = ItemLoader(item=AllItems(), selector=div)
        name = div.xpath(".//small[@class='author']/text()").get()
        if name != 'Albert Einstein':
            l.add_xpath('title', ".//span[@class='text']/text()")
            l.add_value('name', name)
            yield l.load_item()

or

def parse(self, response):
    authors_to_filter = ['Albert Einstein', 'Other Name']
    divs = response.xpath("//div[@class='quote']")
    for div in divs:
        l = ItemLoader(item=AllItems(), selector=div)
        name = div.xpath(".//small[@class='author']/text()").get()
        if name not in authors_to_filter:
            l.add_value('name', name)
            yield l.load_item()
carwave
  • 13
  • 2