0

I am running a python program to process a tab-delimited txt data.

But it causes trouble because it often has unicodes such as U+001A or those in http://en.wikipedia.org/wiki/Newline#Unicode

(Worse, these characters are not even seen unless the txt is opened by sublime txt, not even by notepad++)

If the python program is run on Linux then it automatically ignores such characters, but on Windows, it can't.

For example if there is U+001A in the txt, then the python program will automatically think that's the end of the file.

For another example, if there is U+0085 in the txt, then the python program will think that's the point where a new line starts.

So I just want a separate program that will erase EVERY unicode characters that are not shown in ordinary file openers like notepad++(and that program should work on Windows).

I do want to keep things like あ and ä . But I only to delete things like U+001A and U+0085 which are not seen by notepad++

How can this be achieved?

user1849133
  • 507
  • 7
  • 18
  • 1
    Have you tried opening in binary mode on Windows? – dawg Nov 06 '13 at 01:08
  • @drewk What does it mean to open in binary mode? Can a python program open a txt in "binary mode"? – user1849133 Nov 06 '13 at 01:10
  • 1
    If you erase EVERY unicode character, do you also refer to code points below 128? That would erase quite a lot. – Hyperboreus Nov 06 '13 at 01:17
  • @Hyperboreus Thank you. So I just edited my question from "erasing all unicode" to "erasing characters that are not seen in ordinary openers like notepad++" – user1849133 Nov 06 '13 at 01:32
  • @user2604484 "Ordinary openers" is a wide term. Are you now referring to non-printing (control) characters? – Hyperboreus Nov 06 '13 at 01:33
  • @Hyperboreus By "ordinary opener" I tried to mean things like notepad++, rather than sublime txt. I am sorry for confusion. I would want to keep \t \r \n since these are essential in rendering correct table of data. But things like U+001A and U+0085 do not seem to affect what I see by notepadd+. They only hinder python processing. So I want to erase all such things. – user1849133 Nov 06 '13 at 01:37

2 Answers2

2

There is no such thing as an "unicode character". A character is a character and how it is encoded is on a different page. The capital letter "A" can be encoded in a lot of ways, amongst these UTF-8, EBDIC, ASCII, etc.

If you want to delete every character that cannot be represented in ASCII, then you can use the following (py3):

a = 'aあäbc'
a.encode ('ascii', 'ignore')

This will yield abc.

And if there are really U+001A, i.e. SUBSTITUTE, characters in your document, most probably something has gone haywire in a prior encoding step.

Hyperboreus
  • 29,875
  • 7
  • 42
  • 78
  • Thank you. So I just edited my question from "erasing all unicode" to "erasing characters that are not seen in ordinary openers like notepad++" Then is there a solution for my specific problem? Actually, I do want to keep things like あ and ä . But I only to delete things like U+001A and U+0085 which are not seen by notepad++ – user1849133 Nov 06 '13 at 01:33
  • 1
    Your answer is here: http://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python – Hyperboreus Nov 06 '13 at 01:35
0

Using unicodedata looks to be the best way to do it, as suggested by @Hyperboreus (Stripping non printable characters from a string in python) but as a quick hack you could do (in Python 2.x):

  1. Open source in binary mode. This prevents Windows from truncating reads when it finds the EOL Control Char.

    my_file = open("filename.txt", "rb")
    
  2. Decode the file (assumes encoding was UTF-8:

    my_str = my_file.read().decode("UTF-8")
    
  3. Replace known "bad" code points:

    my_str.replace(u"\u001A", "")
    

You could skip step 2 and replace the UTF-8 encoded value of each "bad" code point in step 3, for example \x1A, but the method above allows for UTF-16/32 source if required.

Community
  • 1
  • 1
Alastair McCormack
  • 23,069
  • 7
  • 60
  • 87