So I recently am trying to extract text from documents using OCR. Sometimes, the OCR adds a "space" in between characters. This becomes an issue when it is an email address.
For e.g., Email:adil.idris2019@gmail.com ------> adil.idris2019@gmail.com
I am trying to use regex to solve this issue.
import re
txt = "wldeub 777-29378-88 @@ Email:adil.idris 2019@gmail.com dfhdu fdlfkos"
txt1 = "Email:michael wobbly@gmail.com 777-123-0000"
txt2 = "Email: john_jebrasky@ gmail.com TX, USA"
txt3 = "john_jebrasky @gmail.com TX, USA"
txt4 = "I am proficient in python. geekcoder 12@gmail.com TX, USA"
out = re.search("Email\:?.+com",txt)
re.sub("Email\:","",re.sub(" ","",out.group(0)))
Unfortunately, this is just a hardcoded fix. NOTE: in some cases, the word Email: might not be present as a prefix with the email. What if there is no Email or, what if the text does not follow any standard pattern??
Ideal output: "adil.idris2019@gmail.com"