0

This is a follow up on this stack overflow question

Select by partial string from a pandas DataFrame

Which returns rows based on a partial string.

df[df['A'].str.contains("hello")]

My question is, how to return rows which contain multiple instances of a partial string.

For example, what if I want to return all rows where a particular column contains 3 instances of the partial string 'ology'. How would I do that?

Example:

testdf = pd.DataFrame([['test1', 'this is biology mixed with zoology', ], ['test2', 'the cat and bat teamed up to find some food'], ['test2' , 'anthropology with pharmacology and biology']])

testdf.head()


>0  1
>0  test1   this is biology mixed with zoology
>1  test2   the cat and bat teamed up to find some food
>2  test2   anthropology with pharmacology and biology

testdf = testdf[testdf[1].str.contains("ology")]
testdf.head()

>0  1
>0  test1   this is biology mixed with zoology
>2  test2   anthropology with pharmacology and biology

What i am looking for is rows with 3 instances of 'ology' , so it would only return the last row

>2  test2   anthropology with pharmacology and biology
Peter Force
  • 229
  • 1
  • 4
  • 11

2 Answers2

2

In this case you don't want to use str.contains but str.count to find the amount of occurences of ology:

testdf[testdf['Col2'].str.count('ology').eq(3)]

Output:

    Col1                                        Col2
2  test2  anthropology with pharmacology and biology

Note I called your columns Col1 and Col2

Erfan
  • 31,924
  • 5
  • 41
  • 51
  • This seems to be removing rows that seem to qualify. I'll make a more specific example, but I am specifically looking for '\n', or line breaks if it makes a difference. Perhaps I should try '*\n*'? – Peter Force Jun 16 '19 at 22:43
2

to use str.contains, you may use pat as follows:

testdf[1].str.contains('(.*ology.*){3}')

Out[29]:
0    False
1    False
2     True
Name: 1, dtype: bool
Andy L.
  • 23,082
  • 3
  • 11
  • 23