0

In the Vectorized String Methods documentation (http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods)...

In [204]: s3 = Series(['A', 'B', 'C', 'Aaba', 'Baca',
   .....:             '', np.nan, 'CABA', 'dog', 'cat'])
   .....: 

In [205]: s3
Out[205]: 
0       A
1       B
2       C
3    Aaba
4    Baca
5        
6     NaN
7    CABA
8     dog
9     cat
dtype: object

In [206]: s3.str.replace('^.a|dog', 'XX-XX ', case=False)
Out[206]: 
0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

Why, in the .replace() example above, is the 'ba' and 'BA' not selected by the regular expression fed as the first argument in the replace() method and replaced by 'XX-XX'? It seems to me to be saying ^ any character followed by . an a, or dog, replace, starting with that any character, with 'XX-XX ', regardless of case.

d8aninja
  • 2,505
  • 4
  • 25
  • 47

1 Answers1

2

This is because 'ba' and 'BA' is not found at the start of string, where alternative has the ^ anchor in ^.a which asserts the position at start of string.

Specified by:
Reference - What does this regex mean?

Community
  • 1
  • 1
Unihedron
  • 10,251
  • 13
  • 53
  • 66
  • So the regex is only finding the first instance of that condition, replacing it and moving on? (I made a small edit at the end of my question to clarify what I think it's 'saying', but had not considered whether the search is repeated more than once per line.) So how would I find ALL instances of that condition and replace ALL of them with XX-XX? What changes in the regex? @unihedron – d8aninja Jul 23 '14 at 23:02
  • 1
    @Canuckish The regex will match `^.a` (anything except newline followed by an "a" at the start of string) `|`(or) `dog` (character sequence "dog" literally") – Unihedron Jul 23 '14 at 23:04
  • 1
    @Canuckish You're welcome. You can remove the anchor, i.e. use `'.a|dog'` instead. – Unihedron Jul 23 '14 at 23:08