-1

So I recently am trying to extract text from documents using OCR. Sometimes, the OCR adds a "space" in between characters. This becomes an issue when it is an email address.

For e.g., Email:adil.idris2019@gmail.com ------> adil.idris2019@gmail.com

I am trying to use regex to solve this issue.

import re
txt = "wldeub 777-29378-88 @@ Email:adil.idris 2019@gmail.com dfhdu fdlfkos"
txt1 = "Email:michael wobbly@gmail.com 777-123-0000"
txt2 = "Email: john_jebrasky@ gmail.com TX, USA"
txt3 = "john_jebrasky @gmail.com TX, USA"
txt4 = "I am proficient in python. geekcoder 12@gmail.com TX, USA"

out = re.search("Email\:?.+com",txt)
re.sub("Email\:","",re.sub(" ","",out.group(0)))

Unfortunately, this is just a hardcoded fix. NOTE: in some cases, the word Email: might not be present as a prefix with the email. What if there is no Email or, what if the text does not follow any standard pattern??

Ideal output: "adil.idris2019@gmail.com"

  • 1
    With regular expressions you cannot distinguish between "character rubbish" and "meaningful content". Thus `dfhdu` and `com` is considered the same. Provide more text snippets to eventually see a pattern. – Jan Feb 28 '21 at 17:17
  • I have update the question with more examples – Deepak Sharma Feb 28 '21 at 17:48

1 Answers1

0

It's a bit of a difficult problem to solve with only regular expressions. You can try separating it into two steps.

One where you get a (very) rough estimate of what might be an email.

((?:[^\"@:+={}()|\s]+ ?){1,3}\@ ?\w+(?: ?\. ?\w+)+)

where [^@:+={}()|\s] is a complete shot in the dark, but these are characters that I doubt will suddenly pop up as false positives in OCR. This will essentially try to match 1 to 3 ({1,3}) blocks of text possibly separated by spaces (...\s]+ ?)...) that don't include an of the characters in [^@:+={}()|\s] and come before an @. Then it will try to match a sequence of domain names and their extensions (.co.uk, .com), possibly separated by spaces ?.

Then you can remove all the whitespace from the matched sequences, and check if they're a valid email address with a proper library/regex: How to check for valid email address?

Not the most clean solution, but I hope it helps.

Edit

I see that you're using a capturing group now, that might explain why it didn't work for you if you have tried it. It should be fixed now.

RegexR example

zoharcochavi
  • 116
  • 6
  • in the example i gave, ~~~txt4 = "I am proficient in python. geekcoder 12@gmail.com TX, USA"~~~ the ideal email id is: geekcoder 12@gmail.com But i assume you misunderstood it to be as python.geekcoder12@gmail.com. There will be only "one space" between the "word" and "@###.co## – Deepak Sharma Mar 08 '21 at 15:20