5

I have a DataFrame that I would like to use the 'str.contrains()' method. I believed I had found how to do this when I read pandas + dataframe - select by partial string. However, I keep getting a value error.

My DataFrame is as follow:

ID,ENROLLMENT_DATE,TRAINER_MANAGING,TRAINER_OPERATOR,FIRST_VISIT_DATE
1536D,12-Feb-12,"06DA1B3-Lebanon NH",,15-Feb-12
F15D,18-May-12,"06405B2-Lebanon NH",,25-Jul-12
8096,8-Aug-12,"0643D38-Hanover NH","0643D38-Hanover NH",25-Jun-12
A036,1-Apr-12,"06CB8CF-Hanover NH","06CB8CF-Hanover NH",9-Aug-12
8944,19-Feb-12,"06D26AD-Hanover NH",,4-Feb-12
1004E,8-Jun-12,"06388B2-Lebanon NH",,24-Dec-11
11795,3-Jul-12,"0649597-White River VT","0649597-White River VT",30-Mar-12
30D7,11-Nov-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",30-Nov-11
3AE2,21-Feb-12,"06405B2-Lebanon NH",,26-Oct-12
B0FE,17-Feb-12,"06D1B9D-Hartland VT",,16-Feb-12
127A1,11-Dec-11,"064456E-Hanover NH","064456E-Hanover NH",11-Nov-12
161FF,20-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",3-Jul-12
A036,30-Nov-11,"063B208-Randolph VT","063B208-Randolph VT",
475B,25-Sep-12,"06D26AD-Hanover NH",,5-Nov-12
151A3,7-Mar-12,"06388B2-Lebanon NH",,16-Nov-12
CA62,3-Jan-12,,,
D31B,18-Dec-11,"06405B2-Lebanon NH",,9-Jan-12
20F5,8-Jul-12,"0669C50-Randolph VT",,3-Feb-12
8096,19-Dec-11,"0649597-White River VT","0649597-White River VT",9-Apr-12
14E48,1-Aug-12,"06D3206-Hanover NH",,
177F8,20-Aug-12,"063B208-Randolph VT","063B208-Randolph VT",5-May-12
553E,11-Oct-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",8-Mar-12
12D5F,18-Jul-12,"0649597-White River VT","0649597-White River VT",2-Nov-12
C6DC,13-Apr-12,"06388B2-Lebanon NH",,
11795,27-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",19-Jun-12
17B43,11-Aug-12,,,22-Oct-12
A036,11-Aug-12,"06D3206-Hanover NH",,19-Jun-12

Then I run the following code:

test = pandas.read_csv('testcsv.csv')
test[test.TRAINER_MANAGING.str.contains('Han', na=False)]

and I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-e0c4624c9346> in <module>()
----> 1 test[test.TRAINER_MANAGING.str.contains('Han', na=False)]

.virtualenvs/ipython/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1958 
   1959             # also raises Exception if object array with NA values
-> 1960             if com._is_bool_indexer(key):
   1961                 key = np.asarray(key, dtype=bool)
   1962             return self._getitem_array(key)

.virtualenvs/ipython/lib/python2.7/site-packages/pandas/core/common.pyc in _is_bool_indexer(key)
    685         if not lib.is_bool_array(key):
    686             if isnull(key).any():
--> 687                 raise ValueError('cannot index with vector containing '
    688                                  'NA / NaN values')
    689             return False

ValueError: cannot index with vector containing NA / NaN values

I feel like I am missing something simple. Any help would be appreciated.

Community
  • 1
  • 1
BigHandsome
  • 3,323
  • 4
  • 20
  • 28

1 Answers1

15

Your string search still returns nan values whereas the slicing operation works with booleans only. It appears 'na=False' is not working (in this case?), i can replicate it on my machine with the latest (released) Pandas version.

You can workaround it by first applying the .fillna() function to the results like:

test[test.TRAINER_MANAGING.str.contains('Han').fillna(False)]

Which returns:

       ID ENROLLMENT_DATE    TRAINER_MANAGING    TRAINER_OPERATOR FIRST_VISIT_DATE
2    8096        8-Aug-12  0643D38-Hanover NH  0643D38-Hanover NH        25-Jun-12
3    A036        1-Apr-12  06CB8CF-Hanover NH  06CB8CF-Hanover NH         9-Aug-12
4    8944       19-Feb-12  06D26AD-Hanover NH                 NaN         4-Feb-12
7    30D7       11-Nov-12  06D95A3-Hanover NH  06D95A3-Hanover NH        30-Nov-11
10  127A1       11-Dec-11  064456E-Hanover NH  064456E-Hanover NH        11-Nov-12
11  161FF       20-Feb-12  0643D38-Hanover NH  0643D38-Hanover NH         3-Jul-12
13   475B       25-Sep-12  06D26AD-Hanover NH                 NaN         5-Nov-12
19  14E48        1-Aug-12  06D3206-Hanover NH                 NaN              NaN
21   553E       11-Oct-12  06D95A3-Hanover NH  06D95A3-Hanover NH         8-Mar-12
24  11795       27-Feb-12  0643D38-Hanover NH  0643D38-Hanover NH        19-Jun-12
26   A036       11-Aug-12  06D3206-Hanover NH                 NaN        19-Jun-12

I have never used the str.contains function before so im not sure if it doesnt work correctly. We should open an issue on github if it should work as in your example.

Rutger Kassies
  • 47,359
  • 12
  • 97
  • 92
  • Thank you for the help. I may post over on github, but I am a complete newb when it comes to pandas, so I figured I would try my question out over here before I embarrassed myself. If you think it is worth a go, I will post over there as well. – BigHandsome Feb 06 '13 at 15:03
  • 1
    @BigHandsome I also think docs suggest `na` should be a fill value (though strangely worded) – Andy Hayden Feb 06 '13 at 15:07
  • @AndyHayden After rereading the API doc[1] and this ticket[2] I think my code should work. But I am pretty new to all this, so I will defer to you guys as to whether I should file a ticket. [1]:http://pandas.pydata.org/pandas-docs/stable/basics.html?highlight=startswith#vectorized-string-methods [2]:https://github.com/pydata/pandas/issues/1689 – BigHandsome Feb 06 '13 at 16:03
  • 1
    @BigHandsome must be a bug, works with `test.TRAINER_MANAGING.str.startswith('Han', na=True)` and `endswith`, which are the only ones tested against in the [commit](https://github.com/pydata/pandas/commit/7a1ea0a3d189b1be7cdd23a7b98f79effd381430) you link to). – Andy Hayden Feb 06 '13 at 16:16
  • 2
    @BigHandsome I added this as an [issue on github](https://github.com/pydata/pandas/issues/2806). – Andy Hayden Feb 06 '13 at 16:22
  • Thank you! I feel better knowing that it is not always me. :) – BigHandsome Feb 06 '13 at 17:17