Erasing all unreadable characters in tab-delimited txt

Question

I am running a python program to process a tab-delimited txt data.

But it causes trouble because it often has unicodes such as U+001A or those in http://en.wikipedia.org/wiki/Newline#Unicode

(Worse, these characters are not even seen unless the txt is opened by sublime txt, not even by notepad++)

If the python program is run on Linux then it automatically ignores such characters, but on Windows, it can't.

For example if there is U+001A in the txt, then the python program will automatically think that's the end of the file.

For another example, if there is U+0085 in the txt, then the python program will think that's the point where a new line starts.

So I just want a separate program that will erase EVERY unicode characters that are not shown in ordinary file openers like notepad++(and that program should work on Windows).

I do want to keep things like あ and ä . But I only to delete things like U+001A and U+0085 which are not seen by notepad++

How can this be achieved?

@drewk What does it mean to open in binary mode? Can a python program open a txt in "binary mode"? — user1849133, Nov 06 '13 at 01:10
If you erase EVERY unicode character, do you also refer to code points below 128? That would erase quite a lot. — Hyperboreus, Nov 06 '13 at 01:17
@Hyperboreus Thank you. So I just edited my question from "erasing all unicode" to "erasing characters that are not seen in ordinary openers like notepad++" — user1849133, Nov 06 '13 at 01:32
@user2604484 "Ordinary openers" is a wide term. Are you now referring to non-printing (control) characters? — Hyperboreus, Nov 06 '13 at 01:33
@Hyperboreus By "ordinary opener" I tried to mean things like notepad++, rather than sublime txt. I am sorry for confusion. I would want to keep \t \r \n since these are essential in rendering correct table of data. But things like U+001A and U+0085 do not seem to affect what I see by notepadd+. They only hinder python processing. So I want to erase all such things. — user1849133, Nov 06 '13 at 01:37

score 2 · Answer 1 · answered Nov 06 '13 at 01:26

2

There is no such thing as an "unicode character". A character is a character and how it is encoded is on a different page. The capital letter "A" can be encoded in a lot of ways, amongst these UTF-8, EBDIC, ASCII, etc.

If you want to delete every character that cannot be represented in ASCII, then you can use the following (py3):

a = 'aあäbc'
a.encode ('ascii', 'ignore')

This will yield abc.

And if there are really U+001A, i.e. SUBSTITUTE, characters in your document, most probably something has gone haywire in a prior encoding step.

answered Nov 06 '13 at 01:26

Hyperboreus

29,875
7
42
78

Thank you. So I just edited my question from "erasing all unicode" to "erasing characters that are not seen in ordinary openers like notepad++" Then is there a solution for my specific problem? Actually, I do want to keep things like あ and ä . But I only to delete things like U+001A and U+0085 which are not seen by notepad++ – user1849133 Nov 06 '13 at 01:33
1

Your answer is here: http://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python – Hyperboreus Nov 06 '13 at 01:35

score 0 · Answer 2 · edited May 23 '17 at 12:03

Using unicodedata looks to be the best way to do it, as suggested by @Hyperboreus (Stripping non printable characters from a string in python) but as a quick hack you could do (in Python 2.x):

Open source in binary mode. This prevents Windows from truncating reads when it finds the EOL Control Char.
```
my_file = open("filename.txt", "rb")
```
Decode the file (assumes encoding was UTF-8:
```
my_str = my_file.read().decode("UTF-8")
```
Replace known "bad" code points:
```
my_str.replace(u"\u001A", "")
```

You could skip step 2 and replace the UTF-8 encoded value of each "bad" code point in step 3, for example \x1A, but the method above allows for UTF-16/32 source if required.

Erasing all unreadable characters in tab-delimited txt

2 Answers2