-1

I am currently reading a text file that has newlines occupying 2 bytes, since it writes newline as CRLF instead of only LF.

std::fstream fileReader = std::fstream(_filename, std::ios::in);

// READ THE SIZE OF THE FILE:
fileReader.seekg(0, fileReader.end); // set reader position to the end
std::streamsize fileSize = fileReader.tellg(); // get reader position
fileReader.seekg(0, fileReader.beg); // set reader position to the start

// SET UP THE BUFFER:
std::vector<char> buffer; buffer.resize(fileSize, '\0');
buffer.back() = '\0';

// READ:
fileReader.read(buffer.data(), fileSize);

The problem is that, "fileSize" IS actually the size of the file, not the amount of characters-that-are-not-CF in the file -which is what it's expecting.

Is there a way to get that number automatically?

Otherwise, I suppose binary mode is the only option left -though it would be pretty disappointing, as I was expecting proper automatic formatting when not using binary mode. Also, the .read function fails (fileReader's failbit is true)

ZeroZ30o
  • 138
  • 1
  • 12
  • Do you want it to write LF-only? – tadman Dec 14 '20 at 02:37
  • @tadman It's not writing, it's reading. I could modify the file to remove the CF without an issue, I'm just disappointed that it doesn't work automatically - ifstream is clearly aware that there are CF characters, since it removes them, and yet it expects me to give the amount of non-CF-characters instead of the total amount of characters. I feel like I shouldn't need to "fix up files" for ifstream to work. – ZeroZ30o Dec 14 '20 at 02:39
  • CF meaning "CRLF" sequences? The file size is expected to include any line delimiters. If you want the "net" size minus that you'll have to read and filter to find out. The operating system does not provide this information, so C++ has no way of knowing short of doing the math. Not sure how that difference is relevant here since you need the entire file in the buffer, so the buffer must be big enough to accommodate that. – tadman Dec 14 '20 at 02:40
  • One thing to consider is instead of using a raw character buffer, just [dump it into a `std::string`](https://stackoverflow.com/questions/2602013/read-whole-ascii-file-into-c-stdstring) instead. – tadman Dec 14 '20 at 02:42
  • @tadman ifstream has a flag for reading binary, if that flag is not set it is meant to automatically parse text files. Again, when using .read into the buffer, it removes the CF characters, so it's clearly doing some work to remove them. Yet it throws a hissy fit if I don't additionally give it that number. I am just asking if, since it's clearly able to detect those CF characters, maybe I'm missing a solution? – ZeroZ30o Dec 14 '20 at 02:43
  • 1
    It needs to know how big the target buffer is so it doesn't exceed the bounds. This is to prevent buffer overflow bugs. That you're over-allocating by a small amount is usually not a big deal. The overhead is marginal, and remember `std::vector` already has some slack built into it anyway. – tadman Dec 14 '20 at 02:44
  • It's able to map them *when reading*, but not casually skipping around the file using `seekg` and such. Computing the actual length is expensive, so C++ does not do it unless you explicitly ask. – tadman Dec 14 '20 at 02:45
  • @tadman That is incorrect and irrelevant, since I provide both the buffer size and the amount of characters to read. No buffer overflow is occurring. What is occurring is that the failbit of fstream is set when I specify the true file size, but not if I specify (file size minus the amount of lines). It's expecting the AMOUNT OF NON-CF CHARACTERS, despite doing internal work to filter CF characters out (it doesn't copy them into the buffer). – ZeroZ30o Dec 14 '20 at 02:47
  • @tadman Yes, that is what's bothering me so much about this, is that tellg provides me with the correct file size at the very end, and yet it freaks out, for some reason. If I don't give the read call (file size - amount of lines), it errors. When it errors, if I check tellg, the entire file has been read. Yet it removed all the CF characters - expecting me to notice they're missing... I just now realized it's an implementation choice, so that the buffer size equals the amount of characters written to it... damn I'm disappointed in whoever decided on this, files are supposed to be simple. – ZeroZ30o Dec 14 '20 at 02:51
  • 2
    Are you reading the file on Windows? Or some other OS? – Eljay Dec 14 '20 at 02:52
  • *"which is what it's expecting."* -- what is "it" in this phrase? If I diagram that sentence, the antecedent appears to be "fileSize", but that does not make sense. – JaMiT Dec 14 '20 at 02:56
  • @Eljay reading the file on Windows, which according to this https://stackoverflow.com/questions/44645395/why-does-the-newline-character-in-an-ifstream-file-when-read-by-this-code-oc matches the correct formatting – ZeroZ30o Dec 14 '20 at 02:57
  • @JaMiT "which is what it's expecting" refers to the last amount I wrote, so, "the amount of characters-that-are-not-CF", but I can see how that's confusing. A better wording would be: it's expecting me to give it "the number of characters that will be extracted into the buffer". My issue is that I cannot predict how many CF characters it will ignore without parsing it myself, which defeats the whole purpose of having .read work in non-binary mode in the first place. – ZeroZ30o Dec 14 '20 at 02:59
  • @ZeroZ30o Huh? You're saying I could re-write that sentence as "The problem is that, 'fileSize' IS actually the size of the file, not the amount of characters-that-are-not-CF in the file -which is what [the amount of characters-that-are-not-CF] is expecting."??? *I asked for the antecedent of "it", not the antecedent of "which". The same problem exists in your "better wording" -- **what** is expecting [X]?* – JaMiT Dec 14 '20 at 03:01
  • @JaMiT I'm confused at what you just said. What I'm saying is that it errors because it's expecting me to provide the amount of characters that will be extracted. It is not expecting me to give the file size, even though that's clearly what it's using internally to know when to stop. It also clearly knows about how many CF characters it skipped over, so it could easily provide the amount of characters extracted and allow me to provide the file size. I'm just saying, it would be hard for anyone on Earth to provide an implementation more annoying than this one. – ZeroZ30o Dec 14 '20 at 03:10
  • If you switch to Linux you won't have this problem, where end of lines are just `NL`s, so the getting the size of the file in this manner (or via `stat`) actually gives you the exact number of characters you will be able to read, from the beginning of the file, before reaching its end. The only other option is open the file with the `ios::binary` flag, but that means that you will see `CR`s in the file, when you read it, and you must deal with them yourself. – Sam Varshavchik Dec 14 '20 at 03:10
  • @SamVarshavchik Yeah, I know about the linux thing - they're doing it the logical way, props to them. And yes, I have concluded the only choice is to use ios:binary as well, it's just stupid that they even allow people to use .read without the binary flag, clearly the specification was not thought through. – ZeroZ30o Dec 14 '20 at 03:16
  • *Is there a way to get that number automatically?* **No, alas.** Windows uses CRLF, which it got from DOS, which DOS got from CP/M. Reading line-by-line will convert the CRLF to end-of-line, but reading the data *en masse* won't work (if I recall correctly; it's been a while, I use *binary* to read/write on Windows). – Eljay Dec 14 '20 at 03:20
  • @ZeroZ30o *"What I'm saying is that it errors"* -- this has the same lack of meaning as your earlier statements. "It", "it", "it" with no indication of what "it" is. **What is this so-called "it" that errors?** What is expecting something? – JaMiT Dec 14 '20 at 03:23
  • @JaMiT The .read call errors, what else would error? It's the only thing that's called that CAN error, since the buffer is resized to an appropriate size. If you wanna get petty: "the ifstream reference returned by the .read call, when converted to a boolean, returns false - its failbit is set". And this last "its" refers to the ifstream reference. Which is the only thing that has a failbit. – ZeroZ30o Dec 14 '20 at 03:25
  • @ZeroZ30o OK, so is the `read` call also the thing you claim is expecting "he amount of characters-that-are-not-CF in the file"? (Because the `read` call does not expect that. Going over by a small amount is fine.) – JaMiT Dec 14 '20 at 03:30
  • @JaMiT Yes, that's what I claim. And also, false. Going over by only 1 causes an error. That is, if I tell it to extract 10 characters (because the file size it gave me was 10) but it only extracts 9 (because it found and ignored a CF character) and reaches the end of the file, the file stream fails. – ZeroZ30o Dec 14 '20 at 03:33

2 Answers2

1

Is there a way to get that number automatically?

This cannot be done automatically since the file system does not store the number of line endings in a file. Any approach would need to go through the file inspecting each character. Fortunately, it is possible to leverage the std::fstream class to handle most of the grunt work. The resulting function is surprisingly similar to what you currently have. You just need to grab the number of characters read.

// Gets the number of characters in `textfile` accounting for CR-LF being read as one character.
// The stream will be reset to the beginning when this function returns.
std::streamsize char_count(std::fstream & textfile)
{
    std::streamsize count = textfile.gcount();

    // Get an upper bound on the size.
    textfile.clear();
    textfile.seekg(0, textfile.end); // set reader position to the end
    std::streamsize fileSize = textfile.tellg(); // get reader position

    if ( textfile  &&  fileSize != -1 )
    {
        // Read the text.
        std::vector<char> buffer(fileSize);
        textfile.seekg(0, textfile.beg); // set reader position to the start
        textfile.read(buffer.data(), fileSize);

        // Get the count of characters read.
        count = textfile.gcount();
    }

    // Reset the stream.
    textfile.clear(); // Because over-reading would set some flags.
    textfile.seekg(0, textfile.beg);
    textfile.clear(); // In case the seek failed. We did promise to reset the stream.

    return count;
}

Seems a but wasteful to do this, then repeat the read once you have the number of characters, but since you won't tell us your real problem this might start you in a better direction.

JaMiT
  • 9,693
  • 2
  • 12
  • 26
  • gcount() returns "the number of characters extracted by the last unformatted input operation performed on the object.", according to cplusplus.com. I have not tested it, but I suspect that, because of this, it won't work. Because it would be a formatted operation (unless you use ios::binary, in which case there's no point in doing any of this). – ZeroZ30o Dec 14 '20 at 03:37
  • @ZeroZ30o Suspect what you want. Keep your misunderstandings of streams if you want. This answer is a stop-gap measure. You would have been better off asking about the `fstream`'s "hissy fit" (your real problem) instead of asking about what you believe will resolve the situation (your [XY problem](https://en.wikipedia.org/wiki/XY_problem) -- read the linked article if you are not familiar with the term). – JaMiT Dec 15 '20 at 00:52
  • sorry, forgot you were in my head and you knew what my problem was better than me. I mean, after all, I should really just give you my job so you can do everything for me, right? I wouldn't want to make any choices for myself now, would I? That would be terrible -it would be impossible for me to ask about loading a text file into a buffer if that's what I needed, there must be some other thing I'm trying to achieve instead, or I must be under the effect of a psychological bias because someone like me could NEVER ask the right thing :) – ZeroZ30o Dec 15 '20 at 01:12
-1

This is normal behavior:

ifstream::read expects "the amount of characters that will be extracted" as a parameter, not the file size.

Since this amount is impossible to predict, and the function does not provide it either, using ifstream::read without the ios::binary flag is completely useless, unless the file is KNOWN to not contain any CF characters that will make ifstream freak out.

(I do not know if other characters also make ifstream freak out)

I suggest using the ios::binary in fstream, even for reading text files, to read them in one operation. Probably for writing files too (especially if you're on Windows).

ZeroZ30o
  • 138
  • 1
  • 12