8

Am I right that this code introduces undefined behavior?

#include <stdio.h>
#include <stdlib.h>

FILE *f = fopen("textfile.txt", "rb");
fseek(f, 0, SEEK_END);
long fsize = ftell(f);
fseek(f, 0, SEEK_SET);  //same as rewind(f);

char *string = malloc(fsize + 1);
fread(string, fsize, 1, f);
fclose(f);

string[fsize] = 0;

The reason I'm asking is that this code is posted as an accepted and highly-upvoted answer to the following question: C Programming: How to read the whole file contents into a buffer

However, according to the following article: How to read an entire file into memory in C++ (which, despite its title, also deals with C, so stick with me):

Suppose you were writing C, and you had a FILE* (that you know points to a file stream, or at least a seekable stream), and you wanted to determine how many characters to allocate in a buffer to store the entire contents of the stream. Your first instinct would probably be to write code like this:

// Bad code; undefined behaviour
fseek(p_file, 0, SEEK_END);
long file_size = ftell(p_file);

Seems legit. But then you start getting weirdness. Sometimes the reported size is bigger than the actual file size on disk. Sometimes it’s the same as the actual file size, but the number of characters you read in is different. What the hell is going on?

There are two answers, because it depends on whether the file has been opened in text mode or binary mode.

Just in case you donlt know the difference: in the default mode – text mode – on certain platforms, certain characters get translated in various ways during reading. The most well-known is that on Windows, newlines get translated to \r\n when written to a file, and translated the other way when read. In other words, if the file contains Hello\r\nWorld, it will be read as Hello\nWorld; the file size is 12 characters, the string size is 11. Less well-known is that 0x1A (or Ctrl-Z) is interpreted as the end of the file, so if the file contains Hello\x1AWorld, it will be read as Hello. Also, if the string in memory is Hello\x1AWorld and you write it to a file in text mode, the file will be Hello. In binary mode, no translations are done – whatever is in the file gets read in to your program, and vice versa.

Immediately you can guess that text mode is going to be a headache – on Windows, at least. More generally, according to the C standard:

The ftell function obtains the current value of the file position indicator for the stream pointed to by stream. For a binary stream, the value is the number of characters from the beginning of the file. For a text stream, its file position indicator contains unspecified information, usable by the fseek function for returning the file position indicator for the stream to its position at the time of the ftell call; the difference between two such return values is not necessarily a meaningful measure of the number of characters written or read.

In other words, when you’re dealing with a file opened in text mode, the value that ftell() returns is useless… except in calls to fseek(). In particular, it doesn’t necessarily tell you how many characters are in the stream up to the current point.

So you can’t use the return value from ftell() to tell you the size of the file, the number of characters in the file, or for anything (except in a later call to fseek()). So you can’t get the file size that way.

Okay, so to hell with text mode. What say we work in binary mode only? As the C standard says: "For a binary stream, the value is the number of characters from the beginning of the file." That sounds promising.

And, indeed, it is. If you are at the end of the file, and you call ftell(), you will find the number of bytes in the file. Huzzah! Success! All we need to do now is get to the end of the file. And to do that, all you need to do is fseek() with SEEK_END, right?

Wrong.

Once again, from the C standard:

Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state.

To understand why this is the case: Some platforms store files as fixed-size records. If the file is shorter than the record size, the rest of the block is padded. When you seek to the “end”, for efficiency’s sake it just jumps you right to the end of the last block… possibly long after the actual end of the data, after a bunch of padding.

So, here’s the situation in C:

  • You can’t get the number of characters with ftell() in text mode.
  • You can get the number of characters with ftell() in binary mode… but you can’t seek to the end of the file with fseek(p_file, 0, SEEK_END).

I don't have enough knowledge to judge who's right here, and if the aforemented accepted answer indeed clashes with this article, so I'm asking this question.

Community
  • 1
  • 1
  • 1
    One thing, you did not check the return value of `malloc()`, if it fails, you'll have UB. – Sourav Ghosh Apr 25 '17 at 08:59
  • 1
    @SouravGhosh Sure thing, but that's not the core issue here. –  Apr 25 '17 at 09:00
  • 2
    Correct, that is why it's a comment, not an answer. :) – Sourav Ghosh Apr 25 '17 at 09:01
  • See [this answer](http://stackoverflow.com/a/39666403/971127). It's undefined behavior. So It's not portable. – BLUEPIXY Apr 25 '17 at 09:16
  • The most robust and portable way is still to read characters until EOF and count them. (and while you're at it you could store them into an array and resize the array when needed) – joop Apr 25 '17 at 10:50

1 Answers1

4

What the author of the article is maliciously omitting is the context of the quote.

From the C11 draft standard n1570, NON-NORMATIVE FOOTNOTE 268:

Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state.

The normative part of the standard that refers to the footnote is this 7.21.3 Files:

9 Although both text and binary wide-oriented streams are conceptually sequences of wide characters, the external file associated with a wide-oriented stream is a sequence of multibyte characters, generalized as follows:

— Multibyte encodings within files may contain embedded null bytes (unlike multibyte encodings valid for use internal to the program).

— A file need not begin nor end in the initial shift state. 268)

Note that this concerns wide-oriented streams.

Now, in 7.21.9.2 The fseek function

3 For a binary stream, the new position, measured in characters from the beginning of the file, is obtained by adding offset to the position specified by whence. The specified position is the beginning of the file if whence is SEEK_SET, the current value of the file position indicator if SEEK_CUR, or end-of-file if SEEK_END. A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

The language is a considerably less dire final sentence:

"A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END."

Community
  • 1
  • 1
EOF
  • 5,857
  • 2
  • 23
  • 45
  • C is designed to be implementable even on file systems that do rather strange and bizarre things. If a file system doesn't keep track of file sizes accurate to the byte, requiring that implementations do so would likely make them incapable of exchanging data with other programs. The authors of the Standard thus allow for implementations where binary files might not have a real concept of "EOF". That does not imply that any *quality* implementation running on a file system that naturally tracks file sizes should do anything other than behave in the obvious useful fashion. – supercat Apr 25 '17 at 22:38
  • The notion that a quality implementation should treat Undefined Behavior as "throwing the laws of time and causality out the window" rather than "behaving during translation or program execution in a documented manner characteristic of the environment", *even in cases where the environment would have a clear documented behavior*, may be fashionable, but should be recognized as stupid and destructive. – supercat Apr 25 '17 at 22:44
  • 1
    I'll have to disagree with your last point. Given the existence of explicitly *implementation defined* and *unspecified* behavior, there should be no need for implementations to also treat *undefined behavior* like *implementation defined*. If anything, the standard should perhaps be amended to specify a few more things as *implementation defined*. – EOF Apr 25 '17 at 22:47