pandas data mining from Eurostat

Question

I'm starting a work to analyse data from Stats Institutions like Eurostat using python, and so pandas. I found out there are two methods to get data from Eurostat.

pandas_datareader: it seems very easy to use but I found some problems to get some specific data
pandasdmx: I've found it a bit complicated but it seems a promising solution, but documentation is poor

I use a free Azure notebook, online service, but I don't think it will complicate more my situation.

Let me explain the problems for pandas_datareader. According to the pandas documentation, in the section API, there is this short documented package and it works. Apart from the shown example, that nicely works, a problem arises about other tables. For example, I can get data about European house price, which ID table is prc_hpi_a with this simple code:

import pandas_datareader.data as web
import datetime
df = web.DataReader('prc_hpi_a', 'eurostat')

But the table has three types of data about dwellings: TOTAL, EXISTING and NEW. I got only Existing dwellings and I don't know how to get the other ones. Do you have a solution for these types of filtering.

Secondly there is the path using pandasdmx. Here it is more complicated. My idea is to upload all data to a pandas DataFrame, and then I can analyse as I want. Easy to say, but I've not find many tutorials that explain this passage: upload data to pandas structures. For example, I found this tutorial, but I'm stuck to the first step, that is instantiate a client:

import pandasdmx
from pandasdmx import client
#estat=client('Eurostat', 'milk.db')

and it returns:

--------------------------------------------------------------------------- ImportError Traceback (most recent call last) in () 1 import pandasdmx ----> 2 from pandasdmx import client 3 estat=client('Eurostat', 'milk.db')

ImportError: cannot import name 'client'

What's the problem here? I've looked around but no answer to this problem

I also followed this tutorial:

from pandasdmx import Request
estat = Request('ESTAT')
metadata = estat.datastructure('DSD_une_rt_a').write()
metadata.codelist.iloc[8:18]
resp = estat.data('une_rt_a', key={'GEO': 'EL+ES+IE'}, params={'startPeriod': '2007'})
data = resp.write(s for s in resp.data.series if s.key.AGE == 'TOTAL')
data.columns.names
data.columns.levels
data.loc[:, ('PC_ACT', 'TOTAL', 'T')]

I got the data, but my purpose is to upload them to a pandas structure (Series, DataFrame, etc..), so I can handle easily according to my work. How to do that? Actually I did with this working line (below the previous ones):

s=pd.DataFrame(data)

But it doesn't work if I try to get other data tables. Let me explain with another example about the Harmonized Index Current Price table:

estat = Request('ESTAT')
metadata = estat.datastructure('DSD_prc_hicp_midx').write()
resp = estat.data('prc_hicp_midx')
data = resp.write(s for s in resp.data.series if s.key.COICOP == 'CP00')

It returns an error here, that is:

--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () 2 metadata = estat.datastructure('DSD_prc_hicp_midx').write() 3 resp = estat.data('prc_hicp_midx') ----> 4 data = resp.write(s for s in resp.data.series if s.key.COICOP == 'CP00') 5 #metadata.codelist 6 #data.loc[:, ('TOTAL', 'INX_Q','EA', 'Q')]

~/anaconda3_501/lib/python3.6/site-packages/pandasdmx/api.py in getattr(self, name) 622 Make Message attributes directly readable from Response instance 623 ''' --> 624 return getattr(self.msg, name) 625 626 def _init_writer(self, writer):

AttributeError: 'DataMessage' object has no attribute 'data'

Why does it do not get data now? What's wrong now?

I lost almost a day looking around for some clear examples and explanations. Do you have some to propose? Is there a full and clear documentation? I found also this page with other examples, explaining the use of categorical schemes, but it is not for Eurostat (as explained at some point)

Both methods could work, apart from some explained issues, but I need also a suggestion to have a definitely method to use, to query Eurostat but also many other institutions like OECD, World Bank, etc... Could you guide me to a definitive and working solution, even if it is different for each institution?

score 2 · Accepted Answer · answered May 25 '18 at 19:34

That's my definitive answer to my question that works for each type of data collected from Eurostat. I post here because it can be useful for many.

Let me propose some examples. They produce three pandas series (EU_unempl,EU_GDP,EU_intRates) with data and correct time indexes

#----Unemployment Rate---------
dataEU_unempl=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/ei_lmhr_m?geo=EA&indic=LM-UN-T-TOT&s_adj=NSA&unit=PC_ACT',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range(int(sorted(dataEU_unempl['value'].keys())[0]),1+int(sorted(dataEU_unempl['value'].keys(),reverse=True)[0])):
    x=numpy.append(x,dataEU_unempl['value'][str(i)])
EU_unempl=pd.Series(x,index=pd.date_range((pd.to_datetime((sorted(dataEU_unempl['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_unempl['value'].keys())[0])]),format='%YM%M')), periods=len(x), freq='M')) #'1/1993'


#----GDP---------
dataEU_GDP=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/namq_10_gdp?geo=EA&na_item=B1GQ&s_adj=NSA&unit=CP_MEUR',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range((sorted(int(v) for v in dataEU_GDP['value'].keys())[0]),1+(sorted((int(v) for v in dataEU_GDP['value'].keys()),reverse=True))[0]):
    x=numpy.append(x,dataEU_GDP['value'][str(i)])
EU_GDP=pd.Series(x,index=pd.date_range((pd.Timestamp(sorted(dataEU_GDP['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_GDP['value'].keys())[0])])), periods=len(x), freq='Q'))


#----Money market interest rates---------
dataEU_intRates=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/irt_st_m?geo=EA&intrt=MAT_ON',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range((sorted(int(v) for v in dataEU_intRates['value'].keys())[0]),1+(sorted((int(v) for v in dataEU_intRates['value'].keys()),reverse=True))[0]):
    x=numpy.append(x,dataEU_intRates['value'][str(i)])
EU_intRates=pd.Series(x,index=pd.date_range((pd.to_datetime((sorted(dataEU_intRates['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_intRates['value'].keys())[0])]),format='%YM%M')), periods=len(x), freq='M'))

score 1 · Answer 2 · answered May 18 '18 at 11:32

1

The general solution is to not rely on overly-specific APIs like datareader and instead go to the source. You can use datareader's source code as inspiration and as a guide for how to do it. But ultimately when you need to get data from a source, you may want to directly access that source and load the data.

One very popular tool for HTTP APIs is requests. You can easily use it to load JSON data from any website or HTTP(S) service. Once you have the JSON, you can load it into Pandas. Because this solution is based on general-purpose building blocks, it is applicable to virtually any data source on the Web (as opposed to e.g. pandaSDMX, which is only applicable to SDMX data sources).

answered May 18 '18 at 11:32

John Zwinck

207,363
31
261
371

Could you provide some simple examples to solve some of issues I encountered using that package? I mean to be specific on tables I proposed. Thanks – SPS May 18 '18 at 11:57
@SPS: Better for you to try it and see how far you can get. – John Zwinck May 18 '18 at 12:02
I'm struggling with this `import requests import io import pandas as pd r=requests.get('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_midx?geo=EA&coicop=AP&unit=I96&time=2018M04&time=2018M03').content rawData = pd.read_csv(io.StringIO(r.decode('utf-8'))) rawData.describe()` According to [this](http://ec.europa.eu/eurostat/web/json-and-unicode-web-services/getting-started/rest-request), I'd get data about AP prices of Euro countries in 2018M04 and M03. It reports: _{"error":{"status":"400" label:"Dataset contains no data. filtering elements are probably invalid."}}_ – SPS May 18 '18 at 16:19
That error comes from the data source (the URL you tried to get). It has nothing to do with Python, it's simply that you are not using the HTTP API correctly. You can test it using your web browser by clicking: http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_midx?geo=EA&coicop=AP&unit=I96&time=2018M04&time=2018M03 then try to fix that, then use that URL in Python. – John Zwinck May 21 '18 at 03:31
It is clear, but how to get it solved definitely? I mean, how to get Harmonized Index Current Price data correctly from Eurostat? – SPS May 21 '18 at 10:31
Read their documentation for their API and try to figure it out. – John Zwinck May 21 '18 at 12:20
This should be [the correct address](http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_midx?geo=EA&coicop=CP00&unit=I96), but with the commands `r=requests.get('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_midx?geo=EA&coicop=CP00&unit=I96').content rawData = pd.read_csv(io.StringIO(r.decode('utf-8'))) rawData.describe() ` it returns just a list of zeros. Now I think it's a question on how to get data from requests package tool – SPS May 23 '18 at 15:08
The API returns JSON. Why do you try to read it using `read_csv()`? Try `read_json()`! – John Zwinck May 24 '18 at 08:19
Already thought and tried following [this](https://pandas.pydata.org/pandas-docs/stable/io.html#) but not solved yet – SPS May 24 '18 at 15:25
with the command: `data=pd.read_json(url,typ='series',orient='table') data.keys()` it lists all the keys available. All them are [JSON objects](http://www.json.org/). The problem is that if you give this command `data['value']` it lists a series, but the value are strange:{'327': 71.03, '328': 71.64, '329': 71.72, '330': 71.7,....} and not {"0":98.9,"1":99.4,"2":99.7...} Why? Is there a codification problem? How to solve? Rememeber the [correct address](http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_midx?geo=EA&coicop=CP00&unit=I96) – SPS May 25 '18 at 08:50
`"0":98.9` is in `data['value']`, it's just not the first, because dictionaries are unordered. – John Zwinck May 25 '18 at 11:42

pandas data mining from Eurostat

2 Answers2