4

I'm trying to use df[df['col'].str.contains("string")] (described in these two SO questions: 1 & 2) to select rows based on a partial string match. Here's my code:

import requests
import json
import pandas as pd
import datetime

url = "http://api.turfgame.com/v4/zones/all" # get request returns .json 
r = requests.get(url)
df = pd.read_json(r.content) # create a df containing all zone info

print df[df['region'].str.contains("Uppsala")].head()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-23-55bbf5679808> in <module>()
----> 1 print df[df['region'].str.contains("Uppsala")].head()

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1670         if isinstance(key, (Series, np.ndarray, list)):
   1671             # either boolean or fancy integer index
-> 1672             return self._getitem_array(key)
   1673         elif isinstance(key, DataFrame):
   1674             return self._getitem_frame(key)

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
   1714             return self.take(indexer, axis=0, convert=False)
   1715         else:
-> 1716             indexer = self.ix._convert_to_indexer(key, axis=1)
   1717             return self.take(indexer, axis=1, convert=True)
   1718 

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
   1083                     if isinstance(obj, tuple) and is_setter:
   1084                         return {'key': obj}
-> 1085                     raise KeyError('%s not in index' % objarr[mask])
   1086 
   1087                 return indexer

KeyError: '[ nan  nan  nan ...,  nan  nan  nan] not in index'

I don't understand the which I get a KeyError because df.columns returns:

Index([u'dateCreated', u'id', u'latitude', u'longitude', u'name', u'pointsPerHour', u'region', u'takeoverPoints', u'totalTakeovers'], dtype='object')

So the Key is in the list of columns and opening the page in an internet browser I can find 739 instances of 'Uppsala'.

The column in which I'm search was a nested .json table that looks like this {"id":200,"name":"Scotland","country":"gb"}. Do I have do something special to search between '{}' characters? Could somebody explain where I've made my mistake(s)?

Community
  • 1
  • 1
Jason
  • 3,390
  • 7
  • 44
  • 65
  • `df['region'].str.contains("Uppsala").fillna(False)` – behzad.nouri Sep 23 '14 at 22:03
  • @behzad.nouri Could you explain what the `.fillna(False` is supposed to do? I've used it and it appears to have changed all the values to 0. `df['region'].str.contains("Uppsala").fillna(False).describe()` shows a count of 29759 and 0 for all other statistics. – Jason Sep 23 '14 at 22:12
  • can you make sure the data frame is correctly parsed? what does `df.head()` look like? – behzad.nouri Sep 23 '14 at 22:14
  • @behzad.nouri Perhaps that's the problem. I can't copy and paste the output from my IDE (Canopy) but it looks the 'cells' in under the `region` column are filled with the nested `.json` which I've pasted at end of my question in the last paragraph. – Jason Sep 23 '14 at 22:18

1 Answers1

9

Looks to me like your region column contains dictionaries, which aren't really supported as elements, and so .str isn't working. One way to solve the problem is to promote the region dictionaries to columns in their own right, maybe with something like:

>>> region = pd.DataFrame(df.pop("region").tolist())
>>> df = df.join(region, rsuffix="_region")

after which you have

>>> df.head()
                dateCreated     id   latitude  longitude         name  pointsPerHour  takeoverPoints  totalTakeovers country  id_region             name_region
0  2013-06-15T08:00:00+0000  14639  55.947079  -3.206477  GrandSquare              1             185              32      gb        200                Scotland
1  2014-06-15T20:02:37+0000  31571  55.649181  12.609056   Stenringen              1             185               6      dk        172             Hovedstaden
2  2013-06-15T08:00:00+0000  18958  54.593570  -5.955772  Hospitality              0             250               1      gb        206        Northern Ireland
3  2013-06-15T08:00:00+0000  18661  53.754283  -1.526638  LanshawZone              0             250               0      gb        202  Yorkshire & The Humber
4  2013-06-15T08:00:00+0000  17424  55.949285  -3.144777   NoDogsZone              0             250               5      gb        200                Scotland

and

>>> df[df["name_region"].str.contains("Uppsala")].head()
                  dateCreated     id   latitude  longitude          name  pointsPerHour  takeoverPoints  totalTakeovers country  id_region name_region
28   2013-07-16T18:53:48+0000  20828  59.793476  17.775389  MoraStenRast              5             125             536      se        142     Uppsala
59   2013-02-08T21:42:53+0000  14797  59.570418  17.482116      BålWoods              3             155             555      se        142     Uppsala
102  2014-06-19T12:00:00+0000  31843  59.617637  17.077094       EnaAlle              5             125             168      se        142     Uppsala
328  2012-09-24T20:08:22+0000  11461  59.634438  17.066398      BluePark              6             110            1968      se        142     Uppsala
330  2014-08-28T20:00:00+0000  33695  59.867027  17.710792  EnbackensBro              4             140              59      se        142     Uppsala

(A hack workaround would be df["region"].apply(str).str.contains("Uppsala"), but I think it's best to clean the data right at the start.)

DSM
  • 291,791
  • 56
  • 521
  • 443