C/C++ Why to use unsigned char for binary data?

Question

Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers? To make sense of my question, have a look at the code below -

char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';

printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);

both the printf's output correctly, where f0 a4 ad a2 is the encoding for the Unicode code-point U+24B62 () in hex.

Even memcpy also correctly copied the bits held by a char.

What reasoning could possibly advocate the use of unsigned char instead of a plain char?

In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification. But as the above example showed, the output doesn't seem to be affected by any padding as such.

I have used VC++ Express 2010 and MinGW to compile the above. Although VC gave the warning

warning C4309: '=' : truncation of constant value

the output doesn't seems to reflect that.

P.S. This could be marked a possible duplicate of Should a buffer of bytes be signed or unsigned char buffer? but my intent is different. I am asking why something which seems to be working as fine with char should be typed unsigned char?

Update: To quote from N3337,

Section 3.9 Types

2 For any object (other than a base-class subobject) of trivially copyable type T, whether or not the object holds a valid value of type T, the underlying bytes (1.7) making up the object can be copied into an array of char or unsigned char. If the content of the array of char or unsigned char is copied back into the object, the object shall subsequently hold its original value.

In view of the above fact and that my original example was on Intel machine where char defaults to signed char, am still not convinced if unsigned char should be preferred over char.

Anything else?

It is an established convention - why would you want to do it another way? Is there a specific scenario in which you are required to use `char`? — Björn Pollex, Nov 30 '12 at 09:40
If it is just convention I would be happy to follow it. But is there any technical, logical reason behind it? — nightlytrails, Nov 30 '12 at 09:41
If you are providing functions that are used for manipulating both binary and non-binary data, signed char can certainly be more convenient. It's painful having to convert to and from unsigned char when you are dealing with strings. — goji, Nov 30 '12 at 09:42
The question you link gives a technical reason - it is pretty clear on that. — Björn Pollex, Nov 30 '12 at 09:43
@BjörnPollex: In one of your answers to a question you yourself have used char - http://stackoverflow.com/questions/5420317/reading-and-writing-binary-file/5420568#5420568 :) Also, I have quoted the reason from there in my question - "padding" which doesn't seems to matter here as the output shows. — nightlytrails, Nov 30 '12 at 09:44
Then I did something wrong there :) Thanks for catching that, will fix immediately. — Björn Pollex, Nov 30 '12 at 09:45
@BjörnPollex: be a bit careful though, `ifstream` is `basic_ifstream`, not `basic_ifstream`. I don't know whether that affects the fix you just made or not, but it isn't as simple as "in C++, stream data is `unsigned char`". The standard streams disagree. — Steve Jessop, Nov 30 '12 at 11:39
one reason is compatibility: because, on your system, it works, doesn't mean it will work on other systems (there are some very weird system out there, some with chars of a completely different size and representation). — Olivier Dulac, Nov 30 '12 at 17:41
https://www.securecoding.cert.org/confluence/display/seccode/STR00-C.+Represent+characters+using+an+appropriate+type — nightlytrails, Dec 01 '12 at 17:36

Jens Gustedt · Accepted Answer · 2016-06-08T15:09:38.317

In C the unsigned char data type is the only data type that has all the following three properties simultaneously

it has no padding bits, that it where all storage bits contribute to the value of the data
no bitwise operation starting from a value of that type, when converted back into that type, can produce overflow, trap representations or undefined behavior
it may alias other data types without violating the "aliasing rules", that is that access to the same data through a pointer that is typed differently will be guaranteed to see all modifications

if these are the properties of a "binary" data type you are looking for, you definitively should use unsigned char.

For the second property we need a type that is unsigned. For these all conversion are defined with modulo arihmetic, here modulo UCHAR_MAX+1, 256 in most 99% of the architectures. All conversion of wider values to unsigned char thereby just corresponds to truncation to the least significant byte.

The two other character types generally don't work the same. signed char is signed, anyhow, so conversion of values that don't fit it is not well defined. char is not fixed to be signed or unsigned, but on a particular platform to which your code is ported it might be signed even it is unsigned on yours.

Can you explain better the second propriety or give an example please? — sop, Jun 08 '16 at 09:06
"it may alias other data types without violating the "aliasing rules"" This is same for `char` too. — Calmarius, Jul 27 '19 at 18:31
@Calmarius And if `char` is signed, just adding two `char` values can overflow and result in undefined behavior. — Andrew Henle, Feb 05 '20 at 00:58

score 15 · Answer 2 · answered Nov 30 '12 at 10:46

You'll get most of your problems when comparing the contents of individual bytes:

char c[5];
c[0] = 0xff;
/*blah blah*/
if (c[0] == 0xff)
{
    printf("good\n");
}
else
{
    printf("bad\n");
}

can print "bad", because, depending on your compiler, c[0] will be sign extended to -1, which is not any way the same as 0xff

Lundin · Answer 3 · 2012-11-30T10:06:58.653

The plain char type is problematic and shouldn't be used for anything but strings. The main problem with char is that you can't know whether it is signed or unsigned: this is implementation-defined behavior. This makes char different from int etc, int is always guaranteed to be signed.

Although VC gave the warning ... truncation of constant value

It is telling you that you are trying to store int literals inside char variables. This might be related to the signedness: if you try to store an integer with value > 0x7F inside a signed character, unexpected things might happen. Formally, this is undefined behavior in C, though practically you'd just get a weird output if attempting to print the result as an integer value stored inside a (signed) char.

In this specific case, the warning shouldn't matter.

EDIT :

In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification.

In theory, all integer types except unsigned char and signed char are allowed to contain "padding bits", as per C11 6.2.6.2:

"For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter)."

"For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; signed char shall not have any padding bits."

The C standard is intentionally vague and fuzzy, allowing these theoretical padding bits because:

It allows different symbol tables than the standard 8-bit ones.
It allows implementation-defined signedness and weird signed integer formats such as one's complement or "sign and magnitude".
An integer may not necessarily use all bits allocated.

However, in the real world outside the C standard, the following applies:

Symbol tables are almost certainly 8 bits (UTF8 or ASCII). Some weird exceptions exist, but clean implementations use the standard type wchar_t when implementing symbols tables larger than 8 bits.
Signedness is always two's complement.
An integer always uses all bits allocated.

So there is no real reason to use unsigned char or signed char just to dodge some theoretical scenario in the C standard.

@Lundin, integer data types may have padding **bits** not bytes. And yes, `unsigned char` is the only type that is guaranteed not to have padding bits. — Jens Gustedt, Nov 30 '12 at 09:58
So I suppose to avoid the undefined behavior gotcha - `c[0] = 0xF0;` using an `unsigned char` is a good idea? Also, if char is by default unsigned (as on ARM machines), even the above code is fine, but it is platform dependent in correctness as of now. So again, `unsigned char` should be used for platform independence. — nightlytrails, Nov 30 '12 at 10:00
@nightlytrails Alright, now I understand what you meant. I have updated the post with an explanation. — Lundin, Nov 30 '12 at 10:07
@JensGustedt I misunderstood the question. Anyway, neither `unsigned char` nor `signed char` may apparently contain padding bits, as per C11. — Lundin, Nov 30 '12 at 10:09
@nightlytrails The real problems with the unknown signedness of plain char appear when you attempt to use them in any form of arithmetic, particularly when you use them together with bitwise operators. For example: left shift on a signed char is undefined behavior, while left shift on an unsigned char is well-defined. — Lundin, Nov 30 '12 at 10:14
@Lundin, strictly speaking on most architectures the bit operations don't even operate on character types directly, whenever `int` is wider than the character type, first a conversion to `int` is done, then the operation is performed, and at the end the result is eventually converted back to the character type. — Jens Gustedt, Nov 30 '12 at 10:39
@JensGustedt Most 8 or 16 bit MCUs on the market have 8 bit instruction sets. It is just inconvenient for them to promote chars to ints, even though the C standard enforces it (the integer promotion rules). Such MCUs typically optimize away the whole implicit integer promotion, but while doing so they preserve any unexpected oddities caused by promotion, such as change of signedness. — Lundin, Nov 30 '12 at 10:50

Paolo Brandoli · Answer 4 · 2019-08-27T22:02:49.140

Bytes are usually intended as unsigned 8 bit wide integers.

Now, char doesn't specify the sign of the integer: on some compilers char could be signed, on other it may be unsigned.

If I add a bit shift operation to the code you wrote, then I will have an undefined behaviour. The added comparison will also have an unexpected result.

char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
c[0] >>= 1; // If char is signed, will the 7th bit go to 0 or stay the same?

bool isBiggerThan0 = c[0] > 0; // FALSE if char is signed!

printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);

Regarding the warning during the compilation: if the char is signed then you are trying to assign the value 0xf0, which cannot be represented in the signed char (range -128 to +127), so it will be casted to a signed value (-16).

Declaring the char as unsigned will remove the warning, and is always good to have a clean build without any warning.

@GabrielDevillers Thanks for spotting the error. I fixed the answer — Paolo Brandoli, Aug 27 '19 at 22:03

score 4 · Answer 5 · answered Nov 30 '12 at 09:45

The signed-ness of the plain char type is implementation defined, so unless you're actually dealing with character data (a string using the platform's character set - usually ASCII), it's usually better to specify the signed-ness explicitly by either using signed char or unsigned char.

For binary data, the best choice is most probably unsigned char, especially if bitwise operations will be performed on the data (specifically bit shifting, which doesn't behave the same for signed types as for unsigned types).

score 2 · Answer 6 · answered Nov 30 '12 at 09:44

2

I am asking why something which seems to be working as fine with char should be typed unsigned char?

If you do things which are not "correct" in the sense of the standard, you rely on undefined behaviour. Your compiler might do it the way you want today, but you don't know what it does tomorrow. You don't know what GCC does or VC++ 2012. Or even if the behaviour depends on external factors or Debug/Release compiles etc. As soon as you leave the safe path of the standard, you might run into trouble.

answered Nov 30 '12 at 09:44

Philipp

10,577
5
57
111

2

Does the standard says use `unsigned char` for binary? – nightlytrails Nov 30 '12 at 09:47
1

@nightlytrails, in its own language, yes. `unsigned char` is the only type that is guaranteed not to have padding bits and where none of the bit operations would be subject to overflow and other unpredictable behavior. – Jens Gustedt Nov 30 '12 at 10:00

score 2 · Answer 7 · answered Nov 30 '12 at 09:46

Well, what do you call "binary data"? This is a bunch of bits, without any meaning assigned to them by that specific part of software that calls them "binary data". What's the closest primitive data type, which conveys the idea of the lack of any specific meaning to any one of these bits? I think unsigned char.

score 2 · Answer 8 · edited Feb 04 '20 at 23:16

Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers?

"really" necessary? No.

It is a very good idea though, and there are many reasons for this.

Your example uses printf, which not type-safe. That is, printf takes it's formatting cues from the format string and not from the data type. You could just as easily tried:

printf("%s\n", (void*)c);

... and the result would have been the same. If you try the same thing with c++ iostreams, the result will be different (depending on the signed-ness of c).

What reasoning could possibly advocate the use of unsigned char instead of a plain char?

Signed specifies that the most significant bit of the data (for unsigned char the 8-th bit) represents the sign. Since you obviously do not need that, you should specify your data is unsigned (the "sign" bit represents data, not the sign of the other bits).

Well, %s implies a null terminated array of plain `char` in C. This is exactly what am implying - `char` was just fine instead of an `unsigned char`. — nightlytrails, Nov 30 '12 at 11:15
You are right - it is irrelevant what the data type is for printf, as long as the address points to the right location. As I see it, that is not a reason to use `char` instead of `unsigned char` as much as it is a reason to avoid the printf family of functions in c++. — utnapistim, Nov 30 '12 at 12:02

C/C++ Why to use unsigned char for binary data?

8 Answers8

Linked