3

I want to get all the rows in df whose path column contains a substring new+ folder. This question Select by partial string from a pandas DataFrame and the answer by cs95 has been very helpful for substrings like new+ or fol but the results are not correct when I search

new+ folder.

>>>dft = pandas.DataFrame([[ '/new+folder/'], ['/new+ folder/']], columns=['a'])
index     path
0         `/new+folder/`
1         `/new+ folder/`

Now testing with query

>>>print(dft.query('a.str.contains("new+")', engine='python').head())

a
0   new+folder
1  new+ folder
print(dft.query('a.str.contains("new+ ")', engine='python').head())
Empty DataFrame
Columns: [a]
Index: []
>>>print(dft.query('a.str.contains("new+ f")', engine='python').head())
Empty DataFrame
Columns: [a]
Index: []

Testing with contains:

>>>dft[dft['a'].str.contains('new+')]
a
0   new+folder
1   new+ folder
>>>dft[dft['a'].str.contains('new+ ')]
a
>>>dft[dft['a'].str.contains('new+ f')]
a

How can I get the error resolved that comes when there is a after a + or I feel special characters?

Pandas 0.24.2 Python 3.7.3 64-bit

2 Answers2

1

Yes, + is special regex character, need escape it if need working solution with query:

print(dft.query('a.str.contains("new\+ ")', engine='python').head())
               a
1  /new+ folder/

Solution with regex=False here not working:

print(dft.query('a.str.contains("new+ ", regex=False)', engine='python').head())

AttributeError: 'dict' object has no attribute 'append'

If want filtering by boolean indexing working both solutions.

jezrael
  • 629,482
  • 62
  • 918
  • 895
0

Use the below str.contains solution:

>>> dft[dft['a'].str.contains('new+ f', regex=False)]
               a
1  /new+ folder/
>>> 

+ is a regex quantifier, whereas pandas will think you want it as a regex quantifier, so do regex=False to stop making pandas think of it as a regex quantifier.

Timings:

>>> timeit(lambda: dft[dft['a'].str.contains('new\+ f')], number=10000)
7.6474129006344995
>>> timeit(lambda: dft[dft['a'].str.contains('new+ f', regex=False)], number=10000)
7.188472783778991
>>> 

It looks like mine is a bit faster.

U11-Forward
  • 41,703
  • 9
  • 50
  • 73