-2

Regex Extract all substrings that meet the following criteria:

  • first 4 chars are numbers and substring ends with number or letter

  • 15 or 18 chars long

  • If there are 2 substrings that meet the criteria, just return the first one

    df1 = pd.DataFrame(data ={"Messy_IDS":["Looking for ID : 7010M000002N8c5T7A","5634M000002N8c5T7A,7010M000002N8c5T7A","https://website.com/12340000000f5F5"], "Desired_Output":["7010M000002N8c5T7A","5634M000002N8c5T7A","12340000000f5F5"]})
    

df1

        Messy_IDS                                Desired_Output
   0    Looking for ID : 7010M000002N8c5T7A      7010M000002N8c5T7A
   1    5634M000002N8c5T7A,7010M000002N8c5T7A    5634M000002N8c5T7A
   2    https://website.com/12340000000f5F5      12340000000f5F5
TH14
  • 590
  • 3
  • 20

2 Answers2

0

Use Series.str.extract with match by regex for first 4 digits and then for 11 or 14 digits or letters:

df['new'] = df['Messy_IDS'].str.extract('([0-9]{4}[0-9A-Za-z]{11,14})')

Or:

df['new'] = df['Messy_IDS'].str.extract('(\d{4}\w{11,14})')

print (df)
                               Messy_IDS      Desired_Output  \
0    Looking for ID : 7010M000002N8c5T7A  7010M000002N8c5T7A   
1  5634M000002N8c5T7A,7010M000002N8c5T7A  5634M000002N8c5T7A   
2    https://website.com/12340000000f5F5     12340000000f5F5   

                  new  
0  7010M000002N8c5T7A  
1  5634M000002N8c5T7A  
2     12340000000f5F5  
jezrael
  • 629,482
  • 62
  • 918
  • 895
0

You can do it with this code

df1["Messy_IDS"].str.extract("(\d{4}\w+)")
moys
  • 6,776
  • 2
  • 5
  • 27