0

I'm attempting to write a small program reversing the order of characters in a text file. It works, but it treats apostrophes and other special characters strangely.

Here's my code:

ifstream ifs {name};
if(!ifs) throw runtime_error("Couldn't open input file.");

ofstream ofs{"output.txt"};
if(!ofs) throw runtime_error("Couldn't open output file.");

string s;
for(char ch; ifs.get(ch);)
    s.push_back(ch);

reverse(s.begin(), s.end());
for(char ch: s)
    ofs << ch;

Example Input:

And—which is more—you’ll be a Man, my son!

Example Output:

!nos ym ,naM a eb llôÄ‚uoyîÄ‚erom si hcihwîÄ‚dnA    
Remy Lebeau
  • 454,445
  • 28
  • 366
  • 620
  • 2
    It looks like those are Unicode characters. Does this answer your question? [Writing Unicode to a file in C++](https://stackoverflow.com/questions/15911141/writing-unicode-to-a-file-in-c) – scohe001 Jan 15 '20 at 22:51
  • Hi, perhaps your input is UTF-8 rather than ascii (the em dash character for instance). This might help https://stackoverflow.com/questions/4775437/read-unicode-utf-8-file-into-wstring – IronMan Jan 15 '20 at 22:51
  • I don't know if this would make a difference but try using a `wchar` (wide `char`) instead. – Luke Jan 15 '20 at 23:58

1 Answers1

1

Your input file is likely encoded in a multi-byte charset. It does not appear to be UTF-8, though, as is encoded in UTF-8 as bytes E2 80 94, which is — when interpreted in Latin-1, and is encoded in UTF-8 as bytes E2 80 99, which is ’ when interpreted in Latin-1. That is not what you are seeing in your output, though. But the symptom is similar. You are reversing the encoded chars in the string as-is, which will not work for a multi-byte encoding.

To properly reverse a multi-byte encoded string, you would have to know the encoding beforehand and walk through the string based on that encoding, extracting each whole sequence of encoded units and saving each whole unit as-is to the output, rather than reading and saving the individual chars as-is. std::reverse() can't help you with that, unless you use iterators that know how to read and write those whole units.

If you know the encoding beforehand, you may have better luck using std::wifstream/std::wofstream instead, where they are imbue()'ed with a suitable std::locale for the encoding. Then use std::wstring instead of std::string. However, on Windows at least, where std::wstring uses UTF-16, you still have the issue of dealing with multi-unit sequences (though less frequently, unless you are dealing with Eastern Asian languages). So you may have to convert the decoded UTF-16 input to UTF-32 before doing the reversing (then you have to deal with multi-codepoint grapheme clusters), then convert the UTF-32 to UTF-16 before then saving it encoded to the output file.

Also, if you are going to handle the individual chars as-is, to ensure the raw chars are read and written correctly, you should open the files in binary mode, and use UNformatted input/output operations (ie, no operator>> or operator<<):

ifstream ifs(name, std::ios::binary);
if (!ifs) throw runtime_error("Couldn't open input file.");

ofstream ofs("output.txt", std::ios::binary);
if (!ofs) throw runtime_error("Couldn't open output file.");

// Note: there are easier ways to read a file into a std::string!
// See: https://stackoverflow.com/questions/116038/
string s;
for(char ch; ifs.get(ch);)
    s.push_back(ch);

reverse(s.begin(), s.end());

for(char ch: s)
    ofs.put(ch);
// alternatively:
// ofs.write(s.c_str(), s.size());
Remy Lebeau
  • 454,445
  • 28
  • 366
  • 620