0

I have a spider, which has to find the «next» link —the one with the "»" inside— from this HTML:

<div id="content-center">
    <div class="paginador">
      <span class="current">01</span>
      <a href="ml=0">02</a>
      <a href="ml=0">03</a>
      <a href="ml=0">04</a>
      <a href="ml=0">»</a>
      <a href="ml=0">Last</a>
    </div>
</div>

I am trying with this spider

# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
class YourCrawler(CrawlSpider):
    name = "***"
    start_urls = [
    'http://www.***.com/10000000000177/',
    ]
    allowed_domains = ["http://www.***.com/"]
    def parse(self, response):
        s = Selector(response)
        page_list_urls = s.css('#content-center > div.listado_libros.gwe_libros > div > form > dl.dublincore > dd.title > a::attr(href)').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)
        hxs = HtmlXPathSelector(response)
        next_page = hxs.select(u"//*[@id='content-center']/div[@class='paginador']/a[text()='\u00bb']/@href").extract()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield Request(next_page, callback=self.parse)
    def parse_following_urls(self, response):
        for each_book in response.css('div#container'):
            yield {
                'title': each_book.css('div#content > div#primary > div > h1.title-book::text').extract(),
            }

Does not recognize the link, any idea? Any Idea how to solve that?

Thanks!

  • 1
    How about declaring the string inside `select` Unicode, i.e. `u'//a[text()='»']/@href'`? – Tomáš Linhart Jul 21 '17 at 08:39
  • 1
    You are aware that you are defining two strings with `»` as an operator, right? Or is that just an error in the markup of this question? If thats the case this should work for you: `"//a[text()='»']/@href"` – Severin Jul 21 '17 at 08:39
  • Try to use workaround `//a[.="Last"]/preceding-sibling::a[position()=1]` – Andersson Jul 21 '17 at 08:45
  • @severin, I don't understand: which two strings? @tomas, `u'//a[text()='»']/@href'` returns the same message… –  Jul 21 '17 at 10:28
  • @Andersson, I can't use the position because then it will never find the end… Quotes corrected –  Jul 21 '17 at 11:47
  • @Nikita, *end*? You mean that `"Last"` button is absent on the last page or what? – Andersson Jul 21 '17 at 11:50
  • Yes, in the last page there is no link with `»` –  Jul 21 '17 at 11:52

3 Answers3

1

I think BeautifulSoup will do the job

data = '''
<div class="pages">
  <span class="current">01</span>
  <a href="ml=0">02</a>
  <a href="ml=0">03</a>
  <a href="ml=0">04</a>
  <a href="ml=0">05</a>
  <a href="ml=0">06</a>
  <a href="ml=0">07</a>
  <a href="ml=0">08</a>
  <a href="ml=0">09</a>
  <a href="ml=0">10</a>
  <a href="ml=0">»</a>
  <a href="ml=0">Last</a>
</div>

from bs4 import BeautifulSoup
bsobj = BeautifulSoup(data, 'html.parser')
for a in bsobj.find_all('a'):
   if a.text == '»':
      print(a['href'])
ksai
  • 857
  • 6
  • 17
  • Thanks @ksai. I imagine `bs4 import BeautifulSoup ` goes in the head of the class, and the rest inside it, just after `allowed_domains = ["http://www.*****.com/"]. Added full code to avoid confusions. ` –  Jul 21 '17 at 10:33
0

Try and use the \u-escaped version of »:

>>> print(u'\u00bb')
»

like this in your .xpath() call (note the u"..." prefix for the string parameter):

hxs.select(u"//a[text()='\u00bb']/@href").extract()

Your spider .py file is probably using UTF-8:

>>> u'\u00bb'.encode('utf-8')
'\xc2\xbb'

so you can also use hxs.select(u"//a[text()='»']/@href").extract() (the u"..." prefix is still there), but you also need to tell Python what your .py encoding is.

One usually does that with # -*- coding: utf-8 -*- (or equivalent) at the top of the .py file (first line for example).

You can read more on Python source code encoding declarations here and here.

paul trmbrth
  • 19,235
  • 3
  • 47
  • 62
  • Hi @paul. If I add `# -*- coding: utf-8 -*-` to the beggining of the spider, and then the `u` prefix, It returns an error with the `»`… –  Jul 21 '17 at 10:38
  • Can you share the error (through pastebin perhaps)? did you try the `\u00bb` way? Also note comment from @Severin above, `'//a[text()='»']/@href'` should really be `u'//a[text()="»"]/@href'` or `u"//a[text()='»']/@href"` – paul trmbrth Jul 21 '17 at 10:39
  • Ok, I added the `# -*- coding: utf-8 -*-` and the full xpath `"u//div[@id='content-center']/div[@class='paginador']/a[text()='\u00bb']/@href"` and now it doesn't gives any error. But it doesn't find the link. Is it correct? updated the question! –  Jul 21 '17 at 10:58
  • `u` is to be pre-pended to what's in quotes: `u"//div[@id='content-center']/div[@class='paginador']/a[text‌​()='\u00bb']/@href"`, not `"u//div[..."` – paul trmbrth Jul 21 '17 at 11:48
  • Corrected! but it still can't see the link…, question updated! –  Jul 21 '17 at 11:54
  • It's hard to help you because `//div[@id='content-center']/div[@class='paginador']` does not match your sample HTML – paul trmbrth Jul 21 '17 at 13:25
  • Ok, I just realised that with your solution the spider finds the link, the encoding and `\u00bb` did the trick. The problem now is not there: although the spider finds the link, it doesn't follow it, don't know why. But this is another question… Thanks! –  Jul 21 '17 at 14:28
0

There are several things you could change on your code:

  1. You don't need to create/import Selector, the response object has both .css() and .xpath methods which are shortcuts to selector. Docs
  2. HtmlXPathSelector is depracated, you should user selector's(or rather response's) .xpath() method
  3. .extract() will yield a array of urls, so you will not be able to call a request on the array, you should use extract_first() here

Applying these points:

# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request


class YourCrawler(CrawlSpider):
    name = "***"
    start_urls = [
        'http://www.***.com/10000000000177/',
    ]
    allowed_domains = ["http://www.***.com/"]

    def parse(self, response):
        page_list_urls = response.css('#content-center > div.listado_libros.gwe_libros > div > form > dl.dublincore > dd.title > a::attr(href)').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)
        next_page = response.xpath(u"//*[@id='content-center']/div[@class='paginador']/a[text()='\u00bb']/@href").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield Request(next_page, callback=self.parse)

    def parse_following_urls(self, response):
        for each_book in response.css('div#container'):
            yield {
                'title': each_book.css('div#content > div#primary > div > h1.title-book::text').extract(),
            }
Henrique Coura
  • 732
  • 6
  • 14