10

Sometimes I have evil non-printable characters in the middle of a string. These strings are user input, so I must make my program receive it well instead of try to change the source of the problem.

For example, they can have zero width no-break space in the middle of the string. For example, while parsing a .po file, one problematic part was the string "he is a man of god" in the middle of the file. While it everything seems correct, inspecting it with irb shows:

 "he is a man of god".codepoints
 => [104, 101, 32, 105, 115, 32, 97, 32, 65279, 109, 97, 110, 32, 111, 102, 32, 103, 111, 100] 

I believe that I know what a BOM is, and I even handle it nicely. However sometimes I have such characters on the middle of the file, so it is not a BOM.

My current approach is to remove all characters that I found evil in a really smelly fashion:

text = (text.codepoints - CODEPOINTS_BlACKLIST).pack("U*")

The most close I got was following this post which leaded me to :print: option on regexps. However it was no good for me:

"m".scan(/[[:print:]]/).join.codepoints
 => [65279, 109] 

so the question is: How can I remove all non-printable characters from a string in ruby?

fotanus
  • 18,299
  • 11
  • 71
  • 106
  • It'd help a lot if you showed more source and sample strings with the characters you're trying to handle. The current sample doesn't help much when trying to determine the codeset or what other values you're encountering. – the Tin Man May 13 '13 at 20:04
  • @theTinMan Thanks, I edited the question with a bit of more details. The charset is UTF-8, I believe, but I don't always have the info, I got many files without BOM. This one I suppose it is at least partially unicode by looking at the chinese translation. – fotanus May 13 '13 at 20:16
  • Ruby has a method on String called `dump` which produces a new string with non-printing characters removed and special characters escaped. Docs for [String#dump](https://ruby-doc.org/core-2.3.0/String.html#method-i-dump) Ruby 2.3.0 but I can confirm it is in the docs as early as 1.8.7. – Aaron Nov 18 '16 at 14:55

3 Answers3

20

try this:

>>"aaa\f\d\x00abcd".gsub(/[^[:print:]]/,'.')
=>"aaa.d.abcd"
snowytoxa
  • 216
  • 2
  • 3
1

Ruby can help you convert from one multi-byte character set to another. Check into the search results, plus read up on Ruby String's encode method.

Also, Ruby's Iconv is your friend.

Finally, James Grey wrote a series of articles which cover this in good detail.

One of the things you can do using those tools is to tell them to transcode to a visually similar character, or ignore them completely.

Dealing with alternate character sets is one of the most... irritating things I've ever had to do, because files can contain anything, but be marked as text. You might not expect it and then your code dies or starts throwing errors, because people are so ingenious when coming up with ways to insert alternate characters into content.

the Tin Man
  • 150,910
  • 39
  • 198
  • 279
  • Gave up.. I think that there isn't a better way to handle malformed files. However I'm accepting your answer because it is a good guideline for people that ends up here with well formed files. – fotanus May 17 '13 at 21:20
  • None of these links are functional now :( – Luna Lovegood Apr 25 '19 at 14:16
  • @Surya, Thanks, yes a couple were broken, but not *all*. The SO way is to help maintain the site. You're empowered to help by submitting edits to fix problems such as broken links. See "[How do suggested edits work](http://meta.stackexchange.com/questions/76251/how-do-suggested-edits-work)". – the Tin Man May 22 '19 at 23:15
  • Thank you for bring this feature to my attention. – Luna Lovegood May 23 '19 at 04:37
0

Codepoint 65279 is a zero-width no-break space. It is commonly used as a byte-order mark (BOM).

You can remove it from a string with:

my_new_string = my_old_string.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')

A fast way to check if you have any invisible characters is to check the length of the string, if it's higher than what you can see in IRB, you do.

Jamie Buchanan
  • 3,548
  • 3
  • 19
  • 23