0

I am trying to scrape the number of travelers to foreign countries from a site (link in code). For some reason when I actually get the data displayed it ignores any number over 999,999. Maybe someone can spot what I'm missing here.

    import requests
    from lxml import html
    import csv
    import pandas as pd
    import re

    r = requests.get('http://data.worldbank.org/indicator/ST.INT.ARVL/countries/1W     page=4&order=wbapi_data_value_2014%20wbapi_data_value%20wbapi_data_value-  last&sort=asc&display=default')
    data = html.fromstring(r.text)

    Data1995 = []
    Data_1995 = data.xpath("//tbody/tr[td]/td[2]/text()")

    for i in Data_1995:
        i = i.encode('ascii','ignore').strip()
        i = re.sub('[()]', '', i)  # removing ()
        Data1995.append(i)

    Data1995
Sam B
  • 1

2 Answers2

1

Another approach:

Data1995 = []

for elem in data.xpath("//tbody/tr[td]/td[2]"):
    i = elem.xpath("string(.)")
    i = i.encode('ascii','ignore').strip()
    i = re.sub('[()]', '', i)  # removing ()
    Data1995.append(i)

Omitting the text() step from the XPath expression will return the td elements. Then elem.xpath("string(.)") extracts the string-value of each td element. For element nodes, the string value "is the concatenation of the string-values of all text node descendants of the element node in document order."

I recommend this technique in general as it is much more robust. Take the following td element, for example:

<td>A <i>simple</i> example</td>

Selecting td/text() will return two text nodes containing A and example. Typically, this is not what you want. The approach I described returns A simple example.

nwellnhof
  • 28,336
  • 5
  • 76
  • 103
0

Putting together the comments from cricket_007 and Padraic Cunningham You may try the following xpath:

//tbody/tr[td]/td[2][not(span)]/text() | 
//tbody/tr[td]/td[2]/span/text()
hr_117
  • 9,346
  • 1
  • 15
  • 21