Binary data as command line argument

Question

I have a simple c++ program (and a similar one for c) that just prints out the first argument

#include <iostream>

int main(int argc, char** argv)
{
    if(argc > 1)
        std::cout << ">>" << argv[1] << "<<\n";
}

I can pass binary data (i have tried on bash) as argument like

$./a.out $(printf "1\x0123")
  >>1?23<<

If I try to pass a null there i get

./a.out $(printf "1\x0023")
bash: warning: command substitution: ignored null byte in input
>>123<<

Clearly bash(?) does not allow this

But is it possible to send a null as a command line argument this way? Do either c or c++ put any restrictions on this?

Edit: I am not using this in day-to-day c++, this question is just out of curiosity

Even if there is none, it will consider as string only the part up to the null character. — Eugene Sh., Jun 06 '18 at 16:54
The command line arguments are strings, terminated by a null byte. Trying to pass binary data with embedded nulls fails because of that — the data for each argument is deemed terminated by the first null byte. You have to accept that, and redesign your interface so null bytes are not needed or are specially encoded (and what else then needs encoding). Or, if you're implementing your own o/s, you can go with a non-POSIX approach to handling command line arguments. — Jonathan Leffler, Jun 06 '18 at 16:54
That's not passing binary data. That's passing a string representation of a hex number. `printf()` is not even interpreting it correctly, as `0x123` != `123`. And what you're passing to bash is not a valid anything. — Ken White, Jun 06 '18 at 16:55
it is not a good idea to pass binary data this way. Pass it through standard input or file. — Slava, Jun 06 '18 at 16:56
@ken, `printf "1\x0123" | od -c` shows 4 characters: `1 001 2 3` — glenn jackman, Jun 06 '18 at 16:57
@JonathanLeffler I think there is no prevention from C side if the OS is deciding to pass some data with zeros as the `argv` to `main`. — Eugene Sh., Jun 06 '18 at 16:59
@KenWhite Seems like so removed the character when i copy pasted — tejas, Jun 06 '18 at 16:59
@KenWhite: The poster's output does not reliably show which bytes are printed. The variant with the embedded null byte gets called out; the variant without the null byte probably has an invisible control-A (`\x01'`) in the string. — Jonathan Leffler, Jun 06 '18 at 16:59
@EugeneSh.: How does that work? [C11 §5.1.2.2.1 Program startup](https://port70.net/~nsz/c/c11/n1570.html#5.1.2.2.1) says: _If the value of `argc` is greater than zero, the array members `argv[0]` through `argv[argc-1]` inclusive shall contain pointers to strings, which are given implementation-defined values by the host environment prior to program startup._ and [C11 §7.1.1](https://port70.net/~nsz/c/c11/n1570.html#7.1.1) defines a string as _A string is a contiguous sequence of characters terminated by and including the first null character._ That doesn't leave much wriggle room. — Jonathan Leffler, Jun 06 '18 at 17:02
@JonathanLeffler Right. So I guess there is a prevention. But what if the hosting environment is putting some defined stuff after the terminating zero? I guess then C compiler will be free to assume it is not there... — Eugene Sh., Jun 06 '18 at 17:04
However I'd expect that the output would be just "1", not "123". Clearly the "\x00" is not passed down the params. However `printf "1\x0023" | od -c` works - prints the zero character. Interesting :) — axalis, Jun 06 '18 at 17:07
Yep, true. So it seems that not possible directly with params, but you can read the binary string from stdin (or from a "regular" file indeed). — axalis, Jun 06 '18 at 17:15
You *might* be able to do this sort of thing by calling `execl` or `execv`. (But it probably wouldn't work, if the kernel has occasion to recopy the argument list, which it probably does.) — Steve Summit, Jun 06 '18 at 17:37

Jonathan Leffler · Accepted Answer · 2018-06-06T18:35:35.093

This answer is written in C, but can be compiled as C++ and works the same in both. I quote from the C11 standard; there are equivalent definitions in the C++ standards.

There isn't a good way to pass null bytes to a program's arguments

C11 §5.1.2.2.1 Program startup:
If the value of argc is greater than zero, the array members argv[0] through argv[argc-1] inclusive shall contain pointers to strings, which are given implementation-defined values by the host environment prior to program startup.

C11 §7.1.1 Definitions of terms
A string is a contiguous sequence of characters terminated by and including the first null character.

That means that each argument passed to main() in argv is a null-terminated string. There is no reliable data after the null byte at the end of the string — searching there would be accessing out of bounds of the string.

So, as noted at length in the comments to the question, it is not possible in the ordinary course of events to get null bytes to a program via the argument list because null bytes are interpreted as being the end of each argument.

By special agreement

That doesn't leave much wriggle room. However, if both the calling/invoking program and the called/invoked program agree on the convention, then, even with the limitations imposed by the standards, you can pass arbitrary binary data, including arbitrary sequences of null bytes, to the invoked program — up to the limits on the length of an argument list imposed by the implementation.

The convention has to be along the lines of:

All arguments (except argv[0], which is ignored, and the last argument, argv[argc-1]) consist of a stream of non-null bytes followed by a null.
If you need adjacent nulls, you have to provide empty arguments on the command line.
If you need trailing nulls, you have to provide empty arguments as the last arguments on the command line.

This could lead to a program such as (null19.c):

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static void hex_dump(const char *tag, size_t size, const char *buffer);

int main(int argc, char **argv)
{
    if (argc < 2)
    {
        fprintf(stderr, "Usage: %s arg1 [arg2 '' arg4 ...]\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    size_t len_args = 0;
    for (int i = 1; i < argc; i++)
        len_args += strlen(argv[i]) + 1;

    char buffer[len_args];

    size_t offset = 0;
    for (int i = 1; i < argc; i++)
    {
        size_t arglen = strlen(argv[i]) + 1;
        memmove(buffer + offset, argv[i], strlen(argv[i]) + 1);
        offset += arglen;
    }
    assert(offset != 0);
    offset--;

    hex_dump("Argument list", offset, buffer);
    return 0;
}

static inline size_t min_size(size_t x, size_t y) { return (x < y) ? x : y; }

static void hex_dump(const char *tag, size_t size, const char *buffer)
{
    printf("%s (%zu):\n", tag, size);
    size_t offset = 0;
    while (size != 0)
    {
        printf("0x%.4zX:", offset);
        size_t count = min_size(16, size);
        for (size_t i = 0; i < count; i++)
            printf(" %.2X", buffer[offset + i] & 0xFF);
        putchar('\n');
        size -= count;
        offset += count;
    }
}

This could be invoked using:

$ ./null19 '1234' '5678' '' '' '' '' 'def0' ''
Argument list (19):
0x0000: 31 32 33 34 00 35 36 37 38 00 00 00 00 00 64 65
0x0010: 66 30 00
$

The first argument is deemed to consist of 5 bytes — four digits and a null byte. The second is similar. The third through sixth arguments each represent a single null byte (it gets painful if you need large numbers of contiguous null bytes), then there is another string of five bytes (three letters, one digit, one null byte). The last argument is empty but ensures that there is a null byte at the end. If omitted, the output would not include that final terminal null byte.

$ ./null19 '1234' '5678' '' '' '' '' 'def0' 
Argument list (18):
0x0000: 31 32 33 34 00 35 36 37 38 00 00 00 00 00 64 65
0x0010: 66 30
$

This is the same as before except there is no trailing null byte in the data. The two examples in the question are easily handled:

$ ./null19 $(printf "1\x0123")
Argument list (4):
0x0000: 31 01 32 33
$ ./null19 1 23
Argument list (4):
0x0000: 31 00 32 33
$

This works strictly within the standard assuming only that empty strings are recognized as valid arguments. In practice, those arguments are already contiguous in memory so it might be possible on many platforms to avoid the copying phase into the buffer. However, the standard does not stipulate that the argument strings are laid out contiguously in memory.

If you need multiple arguments with binary data, you can modify the convention. For example, you could take a control argument of a string which indicates how many subsequent physical arguments make up one logical binary argument.

All this relies on the programs interpreting the argument list as agreed. It is not really a general solution.

There are certainly utilities which will interpret an empty option argument as though it were a NUL byte (such as the `-d` options for Gnu `cut` and bash's `read`). But in general, it seems to me that if you are defining a mechanism such as the one you describe, implementing standard escape codes would be a lot cleaner, less error-prone, and not that much more difficult to write. — rici, Jun 06 '18 at 19:17
I agree that this is not a good way to do it — the 'by special arrangement' means that only utilities that know about the scheme can use it. And yes, for the most part, handling some sort of escape scheme would be simpler, whether that's as long-winded as "string of pairs of hex digits" or "use `\037` for octal; use `\x1F` for hexadecimal; use ```\\``` for a backslash (…and what else?…)". The use of backslashes is ubiquitous on Unix, of course, but doesn't necessarily work well with the shell (or, more accurately, requires you to be fully aware of which code is interpreting what and how). — Jonathan Leffler, Jun 06 '18 at 19:23
Backslashes are not special inside single quotes, and c-style escapes are not special inside double quotes, all of which facilitates their use (for example in the printf format strings in the OP). There are confusing cases, but for the most part people don't think too hard before writing `printf "%s\n"`. You do sometimes have to use four backslashes, which I guess speaks to your point. You could always use URL-style %-encoding, arguably even simpler but less ubiquitous. — rici, Jun 06 '18 at 19:40

Binary data as command line argument

1 Answers1

There isn't a good way to pass null bytes to a program's arguments

By special agreement