Set of hidden unicode characters in a string

Question

Some hidden set of Unicode characters appear in a string which needs to be removed.

I have a very large text which is extracted from a PDF file using PyPDF2 package. Now this extracted text has a lot of issues in it (like text in tables inside PDF which were structured will appear randomly when extracted) and lots of special characters also get embedded in it (like ~~~~~~~, }}}}}}}} etc) although those texts are not present when viewed as a PDF file. I tried removing those characters using the solution described in this, this and this link but the problem still appears

myText = "There is a set of hidden character here => <= but it will get printed in console"

print(myText)

Now I would like to have a clean text without those hidden characters.

What was the result of doing this: `print(repr(s.encode('ascii', 'ignore')))`? (from one of the links) — SimonF, Jan 30 '19 at 09:34
This is the result `b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'` — mdowes, Jan 30 '19 at 09:38
Right now, yes. If this gets solved maybe I can find my way for other special characters. — mdowes, Jan 30 '19 at 09:46
https://en.wikipedia.org/wiki/C0_and_C1_control_codes for a list of the Control characters (ASCII and Unicode). Note about C1: that are Unicode Code point not UTF-8 encoding bytes. — Giacomo Catenazzi, Jan 31 '19 at 15:09

SimonF · Accepted Answer · 2019-01-30T10:09:47.030

3

The character \x7f is the ascii character DEL, which explains why your attempts did not work. To remove all "special" ascii characters use this code:

See here for the bytes.decode documentation.

import string
a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'
print(repr(a))
print(repr(''.join(i for i in a.decode('ascii', 'ignore') if i in string.printable)))

or this if no you don't want to import string:

a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'
print(repr(a))
print(repr(''.join(i for i in a.decode('ascii', 'ignore') if 31 < ord(i) < 127 or i in '\r\n')))

edited Jan 30 '19 at 10:09

answered Jan 30 '19 at 09:49

SimonF

1,766
6
21

It still doesn't work for me as in my case the string has a character 'b' prepended before it. I would suggest modifying your string to `a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'` – mdowes Jan 30 '19 at 10:03
@mdowes Then you have a bytestring, not a string. I updated the code to deal with that. – SimonF Jan 30 '19 at 10:10

Set of hidden unicode characters in a string

1 Answers1