Unable to get text from parent and child nodes/tags with Scrapy

Question

before this is marked as duplicate, I've searched and tried other solutions found on SO, which are:

The HTML I want to extract from is:

<span class="location">
    Mandarin Oriental Hotel
    <a class="" href="/search-results/Jalan+Pinang%252C+Kuala+Lumpur+City+Centre%252C+50088+Kuala+Lumpur%252C+Wilayah+Persekutuan./?state=Kuala+Lumpur" itemprop="addressRegion" title="Jalan Pinang, Kuala Lumpur City Centre, 50088 Kuala Lumpur, Wilayah Persekutuan.">
    Jalan Pinang, Kuala Lumpur City Centre, 50088 Kuala Lumpur, Wilayah Persekutuan.
    </a>
    ,
    <a class="" href="/search-results/?neighbourhood=Kuala+Lumpur&state=Kuala+Lumpur" title="Kuala Lumpur">
    Kuala Lumpur
    </a>
    ,
    <a class="" href="/search-results/?state=Kuala+Lumpur" title="Kuala Lumpur">
    Kuala Lumpur
    </a>
    <span class="" itemprop="postalCode">
        50088
    </span>
</span>

I want to get all the text in the //span[@class='location'] .

I have tried:

response.xpath("//span[@class='location']//text()").extract_first()
response.css("span.location *::text").extract_first()
response.css("span.location ::text").extract_first()

All of them only return Mandarin Oriental Hotel, not the full address.

EDIT: The text should yield

Mandarin Oriental Hotel Jalan Pinang, Kuala Lumpur City Centre, 50088 Kuala Lumpur, Wilayah Persekutuan., Kuala Lumpur, Kuala Lumpur 50088

I'm not Scrapy user, but I guess this is because you're using `extract_first`. Try `" ".join(response.xpath("//span[@class='location']//text()").extract())` — Andersson, Nov 13 '18 at 09:44
@Andersson That would yield addresses for all individual items in the page unfortunately. The page: https://www.hungrygowhere.my/search-results/?search_location=Kuala+Lumpur — Amir Asyraf, Nov 13 '18 at 10:09
You mean that it returns all the addresses as single string and you want separate address for each result? — Andersson, Nov 13 '18 at 10:18

Andersson · Accepted Answer · 2018-11-13T11:13:55.040

3

Try to use below code to get string representation of each span with address:

for entry in response.xpath("//div[@class='entry']"):
    print(entry.xpath("normalize-space(./span[@class='location'])").extract_first())

edited Nov 13 '18 at 11:13

answered Nov 13 '18 at 10:28

Andersson

47,234
13
52
101

Thank you, this works. I had to remove .extract() from the first line. `for entry in response.xpath("//div[@class='entry']"): print(entry.xpath("normalize-space(./span[@class='location'])").extract_first())` – Amir Asyraf Nov 13 '18 at 11:12
@AmirAsyraf , oh, right. Thanks for corrected me... Answer updated – Andersson Nov 13 '18 at 11:15

score 0 · Answer 2 · answered Nov 13 '18 at 10:15

0

With response.css("span.location ::text").extract_first() you get only first text, so you can try to call response.css("span.location ::text").extract() and then concatenate it.

Also you can try get whole parent element and remove tags from it:

from w3lib.html import remove_tags

data = response.css('span.location').get()
if not data:
    return
result = remove_tags(data)

answered Nov 13 '18 at 10:15

vezunchik

3,469
3
14
25

.extract() would get the full address, but it will also get all other addresses for every entry/item in the page. I want the full individual address for separate for each entry/item. – Amir Asyraf Nov 13 '18 at 11:01

Unable to get text from parent and child nodes/tags with Scrapy

2 Answers2