Matching strings in a pandas dataframe (regex?) for sales offers

Question

Introduction

I have a pandas dataframe that has various columns with information about offers. I am interested in filtering the dataframe using one of the columns that has a string containing information about the offers.

There are two main categories I want to identify:

"up to x% off everything" - key here being the "up to"
"x% off everything"

Sample Strings found in the column

[Random_String] up to x% off everything [Random_String]
[Random_String] x% off everything [Random_String]
[Random_String] up to y% off foo, x% off everything [Random_String]
[Random_String] y% off foo, up to x% off everything [Random_String]
[Random_String] x% off everything, up to y% off foo [Random_String]
[Random_String] up to x% off everything, y% off foo [Random_String]

I am only interested in the x% off, and whether the string either contains up to or not, I am not interested in the y% off a particular product line. The random strings could be any alphanumeric data, however it is only the discount I want to filter / categorise on.

Expected output

For the example strings given above, I would like the classification to fall into two groups: using "up to" or not using "up to". For each of the strings above this would be (random_string markers removed for readability):

up to x% off everything - "up to" group
x% off everything - not in the "up to" group
up to y% off foo, x% off everything - not in the "up to" group
y% off foo, up to x% off everything - "up to" group
x% off everything, up to y% off foo - not in the "up to" group
up to x% off everything, y% off foo - "up to" group

Current Methods

I have been trying to use a regex to filter the dataframe using the below

df[df['column'.str.match('regex_here')==True]

However I am exceedingly new to regex, and as pointed out in the comments I had gotten some of the fundamentals wrong. I now have for the "up to" group:

df[df['column'.str.match('up to \d{1,2}% off', case=false)==True]

The Question

I am struggling to identify how to do the not up to classification, I have read through Regular expression to match a line that doesn't contain a word however I am struggling to adapt this to my purposes.

So I believe there are 3 sub questions in this:

How do I manipulate the regex to accurately classify the "x% off everything" strings in the column
Is there an easy way to classify the four extra strings I have provided, or does each require a separate case?
Is a regex even the correct way to do this? Throughout my reading I am feeling that it may not be appropriate - however it is the only way I can think of doing it so any input on this would be greatly appreciated.

I am fairly new to using python and regex for this kind of stuff, so please reach out if anything is unclear.

I can't tell what your actual question is. You say "This has been achieved using the following expressions", so it sounds like it's working. What is your question, then? — John Gordon, Oct 16 '20 at 15:25
@PatrickArtner Thanks for this - the link was super helpful. I am really new to regex and that has been a really useful link, i had previously been using pythex to check, and when testing my work in a jupyter notebook it seemed to mostly work (other than the 4 strings I classify above), hence why I thought it was working. — Calvin Gibson, Oct 16 '20 at 19:49
@JohnGordon I have rehauled the question, I hope this is clearer now, however please let me know if it is still unclear and I will try to explain as best as I can. — Calvin Gibson, Oct 16 '20 at 19:51
is the `everything` always present for `up to x% everything`? — Patrick Artner, Oct 17 '20 at 07:43

Patrick Artner · Accepted Answer · 2020-10-17T08:17:01.833

You misunderstood how r'[abcd]' works as regex pattern. These are optional possible letters, this matches exactly one of 'abcd'

Same for '[1-100]' this matches all letters between '1' and '1' ( == '1') or the letter '0'. Still only matches one letter.

What you would need is something along the lines of

r'up to \d{1,3}% off'       - to match "up to 74% off"
r'(?! up to )\d{1,3}% off'  - to match "777% off everything" with no 'up to' in front

To match 0 to 100 as numbers you could use something like r'(\d|[1-9]\d|100)' which matches a single digit (0-9), two digits starting with 1-9 or 100.

Online regex tester like http://regex101.com are a good place to start developing regex based on demo data if you are not fully sure what to do - they translate the pattern into normal language to make them easier to understand.

You can partition your data based on r'up to \d{1,3} off everything':

import pandas as pd

data = ["[Random_String] up to 40% off everything [Random_String]",
        "[Random_String] 41% off everything [Random_String]",
        "[Random_String] up to 99% off foo, 42% off everything [Random_String]",
        "[Random_String] y% off foo, up to 43% off everything [Random_String]",
        "[Random_String] 44% off everything, up to 99% off foo [Random_String]",
        "[Random_String] up to 45% off everything, 99% off foo [Random_String]",
        "some data that has neither in it",]

df = pd.DataFrame({"column":data})
df['up to'] = df['column'].str.match(r'.*(up to \d{1,3}% off) everything.*', case=False)

print(df.to_csv(sep='\t'))

Output:

    column                                                                  up to
0   [Random_String] up to 40% off everything [Random_String]                True
1   [Random_String] 41% off everything [Random_String]                      False
2   [Random_String] up to 99% off foo, 42% off everything [Random_String]   False
3   [Random_String] y% off foo, up to 43% off everything [Random_String]    True
4   [Random_String] 44% off everything, up to 99% off foo [Random_String]   False
5   [Random_String] up to 45% off everything, 99% off foo [Random_String]   True
6   some data that has neither in it                                        False

Hi @Patrick , To answer your above comment it may say everything or it may site, however this was easy enough to work around using an or operator. I have tested this on the dataframe and it has worked - thanks. — Calvin Gibson, Oct 19 '20 at 07:11

Ynefota · Answer 2 · 2020-10-16T21:01:37.950

-1

I think you shoud try it with ()-brackets. And you don't need ".*" at the beginning and the end because you dont use "^" before pattern or "$" at the end.

Edit:

I hope I could understand you right now.

My Code:

import re
text = "[Random_String] up to 99% off foo, 50% off everything [Random_String]"
result = re.findall(r"(up to )?(\d{1,2})% off", text)
print(result) # output: [('up to ', '99'), ('', '50')]
print(result[0]) # output: ('up to ', '99') 
print(result[0][1]) # output: 99

edited Oct 16 '20 at 21:01

answered Oct 16 '20 at 16:05

Ynefota

1
1

Hi @Ynefota, I believe I was slightly unclear in my original question, so I have reworked it now. – Calvin Gibson Oct 16 '20 at 20:21