1

Some hidden set of Unicode characters appear in a string which needs to be removed.

I have a very large text which is extracted from a PDF file using PyPDF2 package. Now this extracted text has a lot of issues in it (like text in tables inside PDF which were structured will appear randomly when extracted) and lots of special characters also get embedded in it (like ~~~~~~~, }}}}}}}} etc) although those texts are not present when viewed as a PDF file. I tried removing those characters using the solution described in this, this and this link but the problem still appears

myText = "There is a set of hidden character here => <= but it will get printed in console"

print(myText)

Now I would like to have a clean text without those hidden characters.

mdowes
  • 481
  • 6
  • 16
  • In order to get the hidden character between => and = – mdowes Jan 30 '19 at 09:19
  • What was the result of doing this: `print(repr(s.encode('ascii', 'ignore')))`? (from one of the links) – SimonF Jan 30 '19 at 09:34
  • This is the result `b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'` – mdowes Jan 30 '19 at 09:38
  • Is that the only hidden character you have problems with? – SimonF Jan 30 '19 at 09:44
  • Right now, yes. If this gets solved maybe I can find my way for other special characters. – mdowes Jan 30 '19 at 09:46
  • https://en.wikipedia.org/wiki/C0_and_C1_control_codes for a list of the Control characters (ASCII and Unicode). Note about C1: that are Unicode Code point not UTF-8 encoding bytes. – Giacomo Catenazzi Jan 31 '19 at 15:09

1 Answers1

3

The character \x7f is the ascii character DEL, which explains why your attempts did not work. To remove all "special" ascii characters use this code:

See here for the bytes.decode documentation.

import string
a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'
print(repr(a))
print(repr(''.join(i for i in a.decode('ascii', 'ignore') if i in string.printable)))

or this if no you don't want to import string:

a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'
print(repr(a))
print(repr(''.join(i for i in a.decode('ascii', 'ignore') if 31 < ord(i) < 127 or i in '\r\n')))
SimonF
  • 1,766
  • 6
  • 21
  • It still doesn't work for me as in my case the string has a character 'b' prepended before it. I would suggest modifying your string to `a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'` – mdowes Jan 30 '19 at 10:03
  • @mdowes Then you have a bytestring, not a string. I updated the code to deal with that. – SimonF Jan 30 '19 at 10:10