Why can't I scrape numbers over 999,999? (XPaths in Python)

Question

I am trying to scrape the number of travelers to foreign countries from a site (link in code). For some reason when I actually get the data displayed it ignores any number over 999,999. Maybe someone can spot what I'm missing here.

    import requests
    from lxml import html
    import csv
    import pandas as pd
    import re

    r = requests.get('http://data.worldbank.org/indicator/ST.INT.ARVL/countries/1W     page=4&order=wbapi_data_value_2014%20wbapi_data_value%20wbapi_data_value-  last&sort=asc&display=default')
    data = html.fromstring(r.text)

    Data1995 = []
    Data_1995 = data.xpath("//tbody/tr[td]/td[2]/text()")

    for i in Data_1995:
        i = i.encode('ascii','ignore').strip()
        i = re.sub('[()]', '', i)  # removing ()
        Data1995.append(i)

    Data1995

Because those larger numbers are in another element. `1,750,000` — OneCricketeer, Apr 29 '16 at 16:39
Does that mean that it won't be possible for me to get all of the numbers with one Xpath? — Sam B, Apr 29 '16 at 16:46
http://stackoverflow.com/questions/5350666/xpath-or-operator-for-different-nodes — Padraic Cunningham, Apr 29 '16 at 17:07
@PadraicCunningham Thank you for your help! Got it figured out! — Sam B, Apr 29 '16 at 17:25

nwellnhof · Answer 1 · 2016-04-29T17:52:16.347

Another approach:

Data1995 = []

for elem in data.xpath("//tbody/tr[td]/td[2]"):
    i = elem.xpath("string(.)")
    i = i.encode('ascii','ignore').strip()
    i = re.sub('[()]', '', i)  # removing ()
    Data1995.append(i)

Omitting the text() step from the XPath expression will return the td elements. Then elem.xpath("string(.)") extracts the string-value of each td element. For element nodes, the string value "is the concatenation of the string-values of all text node descendants of the element node in document order."

I recommend this technique in general as it is much more robust. Take the following td element, for example:

<td>A <i>simple</i> example</td>

Selecting td/text() will return two text nodes containing A and example. Typically, this is not what you want. The approach I described returns A simple example.

score 0 · Answer 2 · answered Apr 29 '16 at 17:18

0

Putting together the comments from cricket_007 and Padraic Cunningham You may try the following xpath:

//tbody/tr[td]/td[2][not(span)]/text() | 
//tbody/tr[td]/td[2]/span/text()

answered Apr 29 '16 at 17:18

hr_117

9,346
1
15
21

I figured it out, but this is very close to what I got. Thank you for the suggestion! – Sam B Apr 29 '16 at 17:26

Why can't I scrape numbers over 999,999? (XPaths in Python)

2 Answers2