Read entire UTF-8 file into std::string

Question

I used the following on ASCII file:

#include <fstream>
#include <streambuf>
#include <string>
#include <cerrno>

std::string get_file_contents(const char *filename)
{
  std::ifstream in(filename, std::ios::in | std::ios::binary);
  if (in)
  {
    return(std::string((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>()));
  }
  throw(errno);
}

I want to confirm if it will work for a UTF-8 file as well into std::string or are there any special settings?

`std::string` is more a string of bytes than string of UTF-8 encoding units. Should work fine. — Eljay, Apr 08 '19 at 23:10
A string will store any encoding you like. The tricky part is what you do with it once it is in there. — Galik, Apr 08 '19 at 23:54
All true but some string functions use a locale that does include a character encoding. And, if you have some strings with one character encoding and some strings with another, good luck. Maybe you have a more specific question. — Tom Blodget, Apr 09 '19 at 00:09
Related: [What is the best way to read an entire file into a std::string in C++?](https://stackoverflow.com/questions/116038/) There are many different ways to approach this. — Remy Lebeau, Apr 09 '19 at 01:02

score 2 · Answer 1 · answered Apr 09 '19 at 04:07

It's fine to read all UTF-8 characters like this; it's just a sequence of bytes after all and only when you further process, convert or output text then you'll need to ensure that the encoding is taken into account.

One potential pitfall is the BOM (https://en.wikipedia.org/wiki/Byte_order_mark). If your text file has a BOM then you may want to manually remove it from the string or handle it appropriately. There shouldn't be any need to use the BOM with UTF-8 but some software does it anyway to distinguish types of encoding, presumably. Notepad on Windows saves a BOM, for example (have Notepad save the file with UTF-8 encoding and open the file in the binary editor to check it out).

Read entire UTF-8 file into std::string

1 Answers1