0

I have a CSV file with columns [Name,Email, Address, Credit Card]. I want to apply regex on each column and check whether that column is valid or not. For example, the email column should contain all the email values.

for i in df['Email']:
    
    lst = re.findall('\S+@\S+', i)   
    if lst!=None:
        count=count+1 
        
        print("Match Numer : ",count,"Match Found :   ",lst)
    else:
        print(i," is not a valid email")

The output for a valid email like 'xyz@gmail.com' should be like this : Match Number: 100['xyz@gmail.com'] The problem with this is that the output for an invalid email like 'notvalidemail' is: Match Number: 101[] The else part is never executed! Can someone please help me with this?

1 Answers1

0

I wouldn't use a for loop to do so, pandas already has really cool methods to do regex comparison. Using the same regular experesion, you can do something like this:

valid_emails = df.loc[df.Email.str.match('\S+@\S+'), 'Email']
print(valid_emails)

You can also filter wrong emails by negating the output of the match method

not_valid_emails = df.loc[~(df.Email.str.match('\S+@\S+')), 'Email']
print(not_valid_emails)

Or even quickly counts the number of valid emails:

df.Email.str.match('\S+@\S+').value_counts()

Or if you prefer a percentage approach:

df.Email.str.match('\S+@\S+').value_counts(normalize=True)
jcaliz
  • 3,336
  • 2
  • 6
  • 11