0
 def _clean(text):
    text = text.lower()
    text = re.sub(r'RT|rt', '', text)
    text = re.sub(r'&', '&', text)
    text = re.sub(r'[?!.;:,#@-]', '', text)
    text = re.sub(r"[$&+,:;=?#]|[0-9]|<ed>|<U\+[A-Z0-9]+>", "", text)
    text = re.sub("<+[A-Z0-9]+>", "", text)
    text = re.sub(r'https?|:\//\w.*', '', text)
    text = re.sub(r'\//?w*', '',text)
    text = re.sub(r'\ ã°â*', '' ,text)
    words = text.split()
    words = [w for w in words if w not in stopwords]
    text = " ".join(words)
    text = emoji_pattern.sub(r'', text)
    return text

I have used above code so far.I don't know how to clean this one

happy bihday last friday night (tgif) ððððð last friday night tgif ff â¦

Rajesh
  • 27
  • 5

1 Answers1

1

You can remove all non-ASCII characters using the following regex:

text = re.sub(r'[^\x00-\x7F]+', '', text)

See also this question: Replace non-ASCII characters with a single space

jubnzv
  • 1,152
  • 1
  • 5
  • 15