ifstream reading extra characters (CR 13 LF 10) at end of lines

Question

I am currently reading a text file that has newlines occupying 2 bytes, since it writes newline as CRLF instead of only LF.

std::fstream fileReader = std::fstream(_filename, std::ios::in);

// READ THE SIZE OF THE FILE:
fileReader.seekg(0, fileReader.end); // set reader position to the end
std::streamsize fileSize = fileReader.tellg(); // get reader position
fileReader.seekg(0, fileReader.beg); // set reader position to the start

// SET UP THE BUFFER:
std::vector<char> buffer; buffer.resize(fileSize, '\0');
buffer.back() = '\0';

// READ:
fileReader.read(buffer.data(), fileSize);

The problem is that, "fileSize" IS actually the size of the file, not the amount of characters-that-are-not-CF in the file -which is what it's expecting.

Is there a way to get that number automatically?

Otherwise, I suppose binary mode is the only option left -though it would be pretty disappointing, as I was expecting proper automatic formatting when not using binary mode. Also, the .read function fails (fileReader's failbit is true)

@tadman It's not writing, it's reading. I could modify the file to remove the CF without an issue, I'm just disappointed that it doesn't work automatically - ifstream is clearly aware that there are CF characters, since it removes them, and yet it expects me to give the amount of non-CF-characters instead of the total amount of characters. I feel like I shouldn't need to "fix up files" for ifstream to work. — ZeroZ30o, Dec 14 '20 at 02:39
CF meaning "CRLF" sequences? The file size is expected to include any line delimiters. If you want the "net" size minus that you'll have to read and filter to find out. The operating system does not provide this information, so C++ has no way of knowing short of doing the math. Not sure how that difference is relevant here since you need the entire file in the buffer, so the buffer must be big enough to accommodate that. — tadman, Dec 14 '20 at 02:40
One thing to consider is instead of using a raw character buffer, just [dump it into a `std::string`](https://stackoverflow.com/questions/2602013/read-whole-ascii-file-into-c-stdstring) instead. — tadman, Dec 14 '20 at 02:42
@tadman ifstream has a flag for reading binary, if that flag is not set it is meant to automatically parse text files. Again, when using .read into the buffer, it removes the CF characters, so it's clearly doing some work to remove them. Yet it throws a hissy fit if I don't additionally give it that number. I am just asking if, since it's clearly able to detect those CF characters, maybe I'm missing a solution? — ZeroZ30o, Dec 14 '20 at 02:43
It needs to know how big the target buffer is so it doesn't exceed the bounds. This is to prevent buffer overflow bugs. That you're over-allocating by a small amount is usually not a big deal. The overhead is marginal, and remember `std::vector` already has some slack built into it anyway. — tadman, Dec 14 '20 at 02:44
It's able to map them *when reading*, but not casually skipping around the file using `seekg` and such. Computing the actual length is expensive, so C++ does not do it unless you explicitly ask. — tadman, Dec 14 '20 at 02:45
@tadman That is incorrect and irrelevant, since I provide both the buffer size and the amount of characters to read. No buffer overflow is occurring. What is occurring is that the failbit of fstream is set when I specify the true file size, but not if I specify (file size minus the amount of lines). It's expecting the AMOUNT OF NON-CF CHARACTERS, despite doing internal work to filter CF characters out (it doesn't copy them into the buffer). — ZeroZ30o, Dec 14 '20 at 02:47
@tadman Yes, that is what's bothering me so much about this, is that tellg provides me with the correct file size at the very end, and yet it freaks out, for some reason. If I don't give the read call (file size - amount of lines), it errors. When it errors, if I check tellg, the entire file has been read. Yet it removed all the CF characters - expecting me to notice they're missing... I just now realized it's an implementation choice, so that the buffer size equals the amount of characters written to it... damn I'm disappointed in whoever decided on this, files are supposed to be simple. — ZeroZ30o, Dec 14 '20 at 02:51
*"which is what it's expecting."* -- what is "it" in this phrase? If I diagram that sentence, the antecedent appears to be "fileSize", but that does not make sense. — JaMiT, Dec 14 '20 at 02:56
@Eljay reading the file on Windows, which according to this https://stackoverflow.com/questions/44645395/why-does-the-newline-character-in-an-ifstream-file-when-read-by-this-code-oc matches the correct formatting — ZeroZ30o, Dec 14 '20 at 02:57
@JaMiT "which is what it's expecting" refers to the last amount I wrote, so, "the amount of characters-that-are-not-CF", but I can see how that's confusing. A better wording would be: it's expecting me to give it "the number of characters that will be extracted into the buffer". My issue is that I cannot predict how many CF characters it will ignore without parsing it myself, which defeats the whole purpose of having .read work in non-binary mode in the first place. — ZeroZ30o, Dec 14 '20 at 02:59
@ZeroZ30o Huh? You're saying I could re-write that sentence as "The problem is that, 'fileSize' IS actually the size of the file, not the amount of characters-that-are-not-CF in the file -which is what [the amount of characters-that-are-not-CF] is expecting."??? *I asked for the antecedent of "it", not the antecedent of "which". The same problem exists in your "better wording" -- **what** is expecting [X]?* — JaMiT, Dec 14 '20 at 03:01
@JaMiT I'm confused at what you just said. What I'm saying is that it errors because it's expecting me to provide the amount of characters that will be extracted. It is not expecting me to give the file size, even though that's clearly what it's using internally to know when to stop. It also clearly knows about how many CF characters it skipped over, so it could easily provide the amount of characters extracted and allow me to provide the file size. I'm just saying, it would be hard for anyone on Earth to provide an implementation more annoying than this one. — ZeroZ30o, Dec 14 '20 at 03:10
If you switch to Linux you won't have this problem, where end of lines are just `NL`s, so the getting the size of the file in this manner (or via `stat`) actually gives you the exact number of characters you will be able to read, from the beginning of the file, before reaching its end. The only other option is open the file with the `ios::binary` flag, but that means that you will see `CR`s in the file, when you read it, and you must deal with them yourself. — Sam Varshavchik, Dec 14 '20 at 03:10
@SamVarshavchik Yeah, I know about the linux thing - they're doing it the logical way, props to them. And yes, I have concluded the only choice is to use ios:binary as well, it's just stupid that they even allow people to use .read without the binary flag, clearly the specification was not thought through. — ZeroZ30o, Dec 14 '20 at 03:16
*Is there a way to get that number automatically?* **No, alas.** Windows uses CRLF, which it got from DOS, which DOS got from CP/M. Reading line-by-line will convert the CRLF to end-of-line, but reading the data *en masse* won't work (if I recall correctly; it's been a while, I use *binary* to read/write on Windows). — Eljay, Dec 14 '20 at 03:20
@ZeroZ30o *"What I'm saying is that it errors"* -- this has the same lack of meaning as your earlier statements. "It", "it", "it" with no indication of what "it" is. **What is this so-called "it" that errors?** What is expecting something? — JaMiT, Dec 14 '20 at 03:23
@JaMiT The .read call errors, what else would error? It's the only thing that's called that CAN error, since the buffer is resized to an appropriate size. If you wanna get petty: "the ifstream reference returned by the .read call, when converted to a boolean, returns false - its failbit is set". And this last "its" refers to the ifstream reference. Which is the only thing that has a failbit. — ZeroZ30o, Dec 14 '20 at 03:25
@ZeroZ30o OK, so is the `read` call also the thing you claim is expecting "he amount of characters-that-are-not-CF in the file"? (Because the `read` call does not expect that. Going over by a small amount is fine.) — JaMiT, Dec 14 '20 at 03:30
@JaMiT Yes, that's what I claim. And also, false. Going over by only 1 causes an error. That is, if I tell it to extract 10 characters (because the file size it gave me was 10) but it only extracts 9 (because it found and ignored a CF character) and reaches the end of the file, the file stream fails. — ZeroZ30o, Dec 14 '20 at 03:33

score 1 · Answer 1 · answered Dec 14 '20 at 03:32

Is there a way to get that number automatically?

This cannot be done automatically since the file system does not store the number of line endings in a file. Any approach would need to go through the file inspecting each character. Fortunately, it is possible to leverage the std::fstream class to handle most of the grunt work. The resulting function is surprisingly similar to what you currently have. You just need to grab the number of characters read.

// Gets the number of characters in `textfile` accounting for CR-LF being read as one character.
// The stream will be reset to the beginning when this function returns.
std::streamsize char_count(std::fstream & textfile)
{
    std::streamsize count = textfile.gcount();

    // Get an upper bound on the size.
    textfile.clear();
    textfile.seekg(0, textfile.end); // set reader position to the end
    std::streamsize fileSize = textfile.tellg(); // get reader position

    if ( textfile  &&  fileSize != -1 )
    {
        // Read the text.
        std::vector<char> buffer(fileSize);
        textfile.seekg(0, textfile.beg); // set reader position to the start
        textfile.read(buffer.data(), fileSize);

        // Get the count of characters read.
        count = textfile.gcount();
    }

    // Reset the stream.
    textfile.clear(); // Because over-reading would set some flags.
    textfile.seekg(0, textfile.beg);
    textfile.clear(); // In case the seek failed. We did promise to reset the stream.

    return count;
}

Seems a but wasteful to do this, then repeat the read once you have the number of characters, but since you won't tell us your real problem this might start you in a better direction.

gcount() returns "the number of characters extracted by the last unformatted input operation performed on the object.", according to cplusplus.com. I have not tested it, but I suspect that, because of this, it won't work. Because it would be a formatted operation (unless you use ios::binary, in which case there's no point in doing any of this). — ZeroZ30o, Dec 14 '20 at 03:37
@ZeroZ30o Suspect what you want. Keep your misunderstandings of streams if you want. This answer is a stop-gap measure. You would have been better off asking about the `fstream`'s "hissy fit" (your real problem) instead of asking about what you believe will resolve the situation (your [XY problem](https://en.wikipedia.org/wiki/XY_problem) -- read the linked article if you are not familiar with the term). — JaMiT, Dec 15 '20 at 00:52
sorry, forgot you were in my head and you knew what my problem was better than me. I mean, after all, I should really just give you my job so you can do everything for me, right? I wouldn't want to make any choices for myself now, would I? That would be terrible -it would be impossible for me to ask about loading a text file into a buffer if that's what I needed, there must be some other thing I'm trying to achieve instead, or I must be under the effect of a psychological bias because someone like me could NEVER ask the right thing :) — ZeroZ30o, Dec 15 '20 at 01:12

ZeroZ30o · Accepted Answer · 2020-12-21T21:42:24.190

This is normal behavior:

ifstream::read expects "the amount of characters that will be extracted" as a parameter, not the file size.

Since this amount is impossible to predict, and the function does not provide it either, using ifstream::read without the ios::binary flag is completely useless, unless the file is KNOWN to not contain any CF characters that will make ifstream freak out.

(I do not know if other characters also make ifstream freak out)

I suggest using the ios::binary in fstream, even for reading text files, to read them in one operation. Probably for writing files too (especially if you're on Windows).

ifstream reading extra characters (CR 13 LF 10) at end of lines

2 Answers2