Extract Regex Substring from Pandas column that meets criteria

Question

Regex Extract all substrings that meet the following criteria:

first 4 chars are numbers and substring ends with number or letter
15 or 18 chars long

If there are 2 substrings that meet the criteria, just return the first one

df1 = pd.DataFrame(data ={"Messy_IDS":["Looking for ID : 7010M000002N8c5T7A","5634M000002N8c5T7A,7010M000002N8c5T7A","https://website.com/12340000000f5F5"], "Desired_Output":["7010M000002N8c5T7A","5634M000002N8c5T7A","12340000000f5F5"]})

df1

        Messy_IDS                                Desired_Output
   0    Looking for ID : 7010M000002N8c5T7A      7010M000002N8c5T7A
   1    5634M000002N8c5T7A,7010M000002N8c5T7A    5634M000002N8c5T7A
   2    https://website.com/12340000000f5F5      12340000000f5F5

can you share your attempt? and the expected result – RichieV Oct 15 '20 at 05:48 — RichieV, Oct 15 '20 at 05:48

jezrael · Accepted Answer · 2020-10-15T05:56:20.880

Use Series.str.extract with match by regex for first 4 digits and then for 11 or 14 digits or letters:

df['new'] = df['Messy_IDS'].str.extract('([0-9]{4}[0-9A-Za-z]{11,14})')

Or:

df['new'] = df['Messy_IDS'].str.extract('(\d{4}\w{11,14})')

print (df)
                               Messy_IDS      Desired_Output  \
0    Looking for ID : 7010M000002N8c5T7A  7010M000002N8c5T7A   
1  5634M000002N8c5T7A,7010M000002N8c5T7A  5634M000002N8c5T7A   
2    https://website.com/12340000000f5F5     12340000000f5F5   

                  new  
0  7010M000002N8c5T7A  
1  5634M000002N8c5T7A  
2     12340000000f5F5

score 0 · Answer 2 · answered Oct 15 '20 at 05:53

0

You can do it with this code

df1["Messy_IDS"].str.extract("(\d{4}\w+)")

answered Oct 15 '20 at 05:53

moys

6,776
2
5
27

Extract Regex Substring from Pandas column that meets criteria

2 Answers2