-1

I am pulling tweets in python using tweepy. It gives the entire data in type unicode. Eg: print type(data) gives me <type 'unicode'>

It contains unicode characters in it. Eg: hello\u2026 im am fine\u2019s

I want to remove all of these unicode characters. Is there any regular expression i can use? str.replace isn't a viable option as unicode characters can be any values, from smileys to unicode apostrophes.

mata
  • 60,113
  • 7
  • 139
  • 148
ashish1512
  • 323
  • 1
  • 3
  • 9

1 Answers1

2
In [10]: from unicodedata import normalize

In [11]: out_text = normalize('NFKD', input_text).encode('ascii','ignore')

Try this.

Edit

Actually normalize Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. If you wana more about NFKD go to this link

In [12]: u = unichr(40960) + u'abcd' + unichr(1972)
In [13]: u.encode('utf-8')
Out[13]: '\xea\x80\x80abcd\xde\xb4'
In [14]: u
Out[14]: u'\ua000abcd\u07b4'
In [16]: u.encode('ascii', 'ignore')
Out[16]: 'abcd'

From the above code you will get what encode('ascii','ignore') does.

Ref : https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize

Community
  • 1
  • 1
Rahul K P
  • 9,913
  • 3
  • 28
  • 46
  • data={ "text":"RT @peddoc63: Looks like Trump's anti-establishment and self-funding days are over. Well that took less than 12 hours\ud83d\ude44 https:\/\/t.co\/W7zaUK8\u2026" } Above is a sample. When i do out_text = normalize('NFKD', data).encode('ascii','ignore') It gives me error 'ascii' codec can't encode characters in position 119-120 – ashish1512 May 05 '16 at 07:54
  • @ashish1512: why are you passing a whole dictionary into `normalize()`? – Martijn Pieters May 05 '16 at 07:59
  • It's not exactly a dictionary. When i do print type(data), it gives me – ashish1512 May 05 '16 at 08:01
  • @RahulKP: please do *explain* what the code does. You are not sticking to the letter of the question here; it is nice that your method retains letters without accents, etc. but the OP simply asked to remove any non-ASCII codepoint. – Martijn Pieters May 05 '16 at 08:01
  • @ashish1512: that isn't clear from your comment. You appear to have **undecoded JSON** in that case. In undecoded JSON, `\uhhhh` escape sequences are **not Unicode codepoints** (*yet*). Decode your JSON to Python first. – Martijn Pieters May 05 '16 at 08:07
  • @ashish1512 try `data['text']` instead of `data`. – Rahul K P May 05 '16 at 08:22