I wanted to find out how to extract the quotes and authors from the first page of http://quotes.toscrape.com/ ONLY if the author's name is not Albert Einstein.
<div class="quote">
<span class="text">
"some quote"
</span
<span>
"by "
<small class="author">Albert Einstein</small>
</span>
<span class="text">
"some quote"
</span
<span>
"by "
<small class="author">J.K. Rowling</small>
</span>
I've done some searching and the closest things I can find are these posts, but these only refers to not scraping if the attribute is not equal to something and not if the value is not equal to something.
1 XPath for elements with attribute not equal or does not exist
2 Xpath test for ancestor attribute not equal string
3 How to use "not" in xpath?
4 Using not() in XPath
I currently have...
class AllSpider(scrapy.Spider):
name = 'working'
start_urls = [
'http://quotes.toscrape.com/',
]
def parse(self, response):
divs = response.xpath("//div[@class='quote']")
for div in divs:
l = ItemLoader(item=AllItems(), selector=div)
l.add_xpath('title', ".//span[@class='text']/text()")
l.add_xpath('name', ".//small[@class='author']/text()")
yield l.load_item()
class AllItems(scrapy.Item):
link = scrapy.Field()
title = scrapy.Field()
name = scrapy.Field()
domain = scrapy.Field()
and have tried the following, but it doesn't seem to do anything and I get the same results as without the added code. Any help would be appreciated!!! The only other way I could think of doing this is post crawl when I can use pandas to filter the outputted .csv file, but if there's a way to do it through scrapy, I would love to learn it!
def parse(self, response):
divs = response.xpath("//div[@class='quote']")
for div in divs:
l = ItemLoader(item=AllItems(), selector=div)
if l.add_xpath('name', ".//small[@class='author']/text()") != 'Albert Einstein':
l.add_xpath('title', ".//span[@class='text']/text()")
l.add_xpath('name', ".//small[@class='author']/text()")
yield l.load_item()