Introduction
I have a pandas dataframe that has various columns with information about offers. I am interested in filtering the dataframe using one of the columns that has a string containing information about the offers.
There are two main categories I want to identify:
- "up to x% off everything" - key here being the "up to"
- "x% off everything"
Sample Strings found in the column
- [Random_String] up to x% off everything [Random_String]
- [Random_String] x% off everything [Random_String]
- [Random_String] up to y% off foo, x% off everything [Random_String]
- [Random_String] y% off foo, up to x% off everything [Random_String]
- [Random_String] x% off everything, up to y% off foo [Random_String]
- [Random_String] up to x% off everything, y% off foo [Random_String]
I am only interested in the x% off, and whether the string either contains up to or not, I am not interested in the y% off a particular product line. The random strings could be any alphanumeric data, however it is only the discount I want to filter / categorise on.
Expected output
For the example strings given above, I would like the classification to fall into two groups: using "up to" or not using "up to". For each of the strings above this would be (random_string markers removed for readability):
- up to x% off everything - "up to" group
- x% off everything - not in the "up to" group
- up to y% off foo, x% off everything - not in the "up to" group
- y% off foo, up to x% off everything - "up to" group
- x% off everything, up to y% off foo - not in the "up to" group
- up to x% off everything, y% off foo - "up to" group
Current Methods
I have been trying to use a regex to filter the dataframe using the below
df[df['column'.str.match('regex_here')==True]
However I am exceedingly new to regex, and as pointed out in the comments I had gotten some of the fundamentals wrong. I now have for the "up to" group:
df[df['column'.str.match('up to \d{1,2}% off', case=false)==True]
The Question
I am struggling to identify how to do the not up to classification, I have read through Regular expression to match a line that doesn't contain a word however I am struggling to adapt this to my purposes.
So I believe there are 3 sub questions in this:
- How do I manipulate the regex to accurately classify the "x% off everything" strings in the column
- Is there an easy way to classify the four extra strings I have provided, or does each require a separate case?
- Is a regex even the correct way to do this? Throughout my reading I am feeling that it may not be appropriate - however it is the only way I can think of doing it so any input on this would be greatly appreciated.
I am fairly new to using python and regex for this kind of stuff, so please reach out if anything is unclear.