4

before this is marked as duplicate, I've searched and tried other solutions found on SO, which are:

  1. scrapy css selector: get text of all inner tags
  2. How to get the text from child nodes if it is parents to other node in Scrapy using XPath
  3. scrapy get the entire text including children

The HTML I want to extract from is:

<span class="location">
    Mandarin Oriental Hotel
    <a class="" href="/search-results/Jalan+Pinang%252C+Kuala+Lumpur+City+Centre%252C+50088+Kuala+Lumpur%252C+Wilayah+Persekutuan./?state=Kuala+Lumpur" itemprop="addressRegion" title="Jalan Pinang, Kuala Lumpur City Centre, 50088 Kuala Lumpur, Wilayah Persekutuan.">
    Jalan Pinang, Kuala Lumpur City Centre, 50088 Kuala Lumpur, Wilayah Persekutuan.
    </a>
    ,
    <a class="" href="/search-results/?neighbourhood=Kuala+Lumpur&state=Kuala+Lumpur" title="Kuala Lumpur">
    Kuala Lumpur
    </a>
    ,
    <a class="" href="/search-results/?state=Kuala+Lumpur" title="Kuala Lumpur">
    Kuala Lumpur
    </a>
    <span class="" itemprop="postalCode">
        50088
    </span>
</span>

I want to get all the text in the //span[@class='location'] .

I have tried:

  1. response.xpath("//span[@class='location']//text()").extract_first()
  2. response.css("span.location *::text").extract_first()
  3. response.css("span.location ::text").extract_first()

All of them only return Mandarin Oriental Hotel, not the full address.

EDIT: The text should yield

Mandarin Oriental Hotel Jalan Pinang, Kuala Lumpur City Centre, 50088 Kuala Lumpur, Wilayah Persekutuan., Kuala Lumpur, Kuala Lumpur 50088

Andersson
  • 47,234
  • 13
  • 52
  • 101
Amir Asyraf
  • 378
  • 1
  • 3
  • 13
  • I'm not Scrapy user, but I guess this is because you're using `extract_first`. Try `" ".join(response.xpath("//span[@class='location']//text()").extract())` – Andersson Nov 13 '18 at 09:44
  • @Andersson That would yield addresses for all individual items in the page unfortunately. The page: https://www.hungrygowhere.my/search-results/?search_location=Kuala+Lumpur – Amir Asyraf Nov 13 '18 at 10:09
  • You mean that it returns all the addresses as single string and you want separate address for each result? – Andersson Nov 13 '18 at 10:18

2 Answers2

3

Try to use below code to get string representation of each span with address:

for entry in response.xpath("//div[@class='entry']"):
    print(entry.xpath("normalize-space(./span[@class='location'])").extract_first())
Andersson
  • 47,234
  • 13
  • 52
  • 101
  • Thank you, this works. I had to remove .extract() from the first line. `for entry in response.xpath("//div[@class='entry']"): print(entry.xpath("normalize-space(./span[@class='location'])").extract_first())` – Amir Asyraf Nov 13 '18 at 11:12
  • @AmirAsyraf , oh, right. Thanks for corrected me... Answer updated – Andersson Nov 13 '18 at 11:15
0

With response.css("span.location ::text").extract_first() you get only first text, so you can try to call response.css("span.location ::text").extract() and then concatenate it.

Also you can try get whole parent element and remove tags from it:

from w3lib.html import remove_tags

data = response.css('span.location').get()
if not data:
    return
result = remove_tags(data)
vezunchik
  • 3,469
  • 3
  • 14
  • 25
  • .extract() would get the full address, but it will also get all other addresses for every entry/item in the page. I want the full individual address for separate for each entry/item. – Amir Asyraf Nov 13 '18 at 11:01