30

Consider the following simple program:

#include <cstring>
#include <cstdio>
#include <cstdlib>

void replace(char *str, size_t len) {
    for (size_t i = 0; i < len; i++) {
        if (str[i] == '/') {
            str[i] = '_';
        }
    }
}

const char *global_str = "the quick brown fox jumps over the lazy dog";

int main(int argc, char **argv) {
  const char *str = argc > 1 ? argv[1] : global_str;
  replace(const_cast<char *>(str), std::strlen(str));
  puts(str);
  return EXIT_SUCCESS;
}

It takes an (optional) string on the command line and prints it, with / characters replaced by _. This replacement functionality is implemented by the c_repl function1. For example, a.out foo/bar prints:

foo_bar

Elementary stuff so far, right?

If you don't specify a string, it conveniently uses the global string the quick brown fox jumps over the lazy dog, which doesn't contain any / characters, and so doesn't undergo any replacement.

Of course, string constants are const char[], so I need to cast away the constness first - that's the const_cast you see. Since the string is never actually modified, I am under the impression this is legal.

gcc and clang compile a binary that has the expected behavior, with or without passing a string on the command line. icc crashes, when you don't provide a string, however:

icc -xcore-avx2 char_replace.cpp && ./a.out
Segmentation fault (core dumped)

The underlying cause is the main loop for c_repl which looks like this:

  400c0c:       vmovdqu ymm2,YMMWORD PTR [rsi]
  400c10:       add    rbx,0x20
  400c14:       vpcmpeqb ymm3,ymm0,ymm2
  400c18:       vpblendvb ymm4,ymm2,ymm1,ymm3
  400c1e:       vmovdqu YMMWORD PTR [rsi],ymm4
  400c22:       add    rsi,0x20
  400c26:       cmp    rbx,rcx
  400c29:       jb     400c0c <main+0xfc>

It is a vectorized loop. The basic idea is that 32 bytes are loaded, and then compared against the / character, forming a mask value with a byte set for each byte that matched, and then the existing string is blended against a vector containing 32 _ characters, effectively replacing only the / characters. Finally, the updated register is written back to the string, with the vmovdqu YMMWORD PTR [rsi],ymm4 instruction.

This final store crashes, because the string is read-only and allocated in the .rodata section of the binary, which is loaded using read-only pages. Of course, the store was a logical "no op", writing back the same characters it read, but the CPU doesn't care!

Is my code legal C++ and therefore I should blame icc for miscompiling this, or I am wading into the UB swamp somewhere?


1 The same crash from the same issue occurs with std::replace on a std::string rather than my "C-like" code, but I wanted to simplify the analysis as much as possible and make it entirely self-contained.

BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
  • 1
    Hm, I don't see any UB here. Hopefully the comment-answers become actual answers... – Baum mit Augen Feb 04 '19 at 22:06
  • 6
    But...the string literal is *not* modified, because it doesn't contain a `/` character, and all modifications are predicated on the presence of a `/` character. This really turns on one's interpretation of "never actually modified". The optimizer assumes it's safe to do a *logical* no-op on the string, but it's actually not in this case. Fascinating question; I'm eager to see what the answers will say. – Cody Gray Feb 04 '19 at 22:07
  • "Is my code legal C++" -- sure, legal in the sense that you can have code that sets a pointer to NULL and then treats it as an array. Legal to compile, undefined to execute. C++ allows you to pass const to a non-const function, or to cast an int to a float, or all kinds of other stack-smashing and exception generating behavior. You're assuming the compiler is smart enough to figure out that it can do nothing in your non-const function, but that isn't guaranteed. It's not a promise of the language to save you from yourself here. – Dave S Feb 04 '19 at 22:10
  • 2
    @DaveS How is this UB? Please answer in the answer section so we can vote on your post properly. – Baum mit Augen Feb 04 '19 at 22:12
  • I argue it's UB because you're passing a const string to a non-const function that includes the assignment operator. We humans know that there is no need to execute the optimized code for the function, the compiler wasn't clever enough to figure that out. – Dave S Feb 04 '19 at 22:16
  • 2
    Is this not the same question the one you're linking to (i.e. [this one](https://stackoverflow.com/q/54504247/2069064))? Same thing right - you have a declared `const` object, that we're passing to a handle to a function that _might_ modify it but doesn't actually? – Barry Feb 04 '19 at 22:17
  • 6
    @DaveS Your claim that the mere presence of an assignment in a dead code path already invokes UB needs justification. – Baum mit Augen Feb 04 '19 at 22:17
  • 1
    I think the question is broader - regardless of constness, is the compiler allowed to write to memory when no write is expected? – rustyx Feb 04 '19 at 22:22
  • @rustyx: In general inventing writes (non-atomic load and then storing back the same value) is a huge no-no for compilers. But that's more obviously a problem where the C++ reads one array and conditionally writes to another. If the C++ source *reads* every character, it would be UB for another thread to be writing it simultaneously. As long as this load/blendv/store doesn't go past the end of pointer+length (and non-atomically read/rewrite some memory that might be a lock or something), this might be legal from a C++11 memory model thread-safety POV. – Peter Cordes Feb 04 '19 at 22:45
  • @DaveS - to be very explicit, when I say "legal" I mean to compile and execute, with and without an additional argument. I think that's implies by the question which talks largely about the runtime behavior, but in any case, there's your clarification. – BeeOnRope Feb 04 '19 at 23:07
  • 2
    @Peter, I had the same thought, which is why I shaped this example around the const string and .rodata section rather than an argument about thread safety. It also has the advantage that the problem is obvious: it crashes always, rather than a more abstract problem related to race conditions which may be hard to illustrate. That said, one may still be able to find a way to modify the example such that there is a memory model problem also. – BeeOnRope Feb 04 '19 at 23:11
  • To echo Barry, shouldn't you just file a bug report after your previous question? – Passer By Feb 05 '19 at 01:02
  • 1
    @PasserBy - well when I wrote that other question I wasn't aware of the behavior of icc: I was interested specifically in the `const_cast` question (it turns out _that_ question might be a duplicate though). Today someone pointed out to me that icc does in fact vectorize this type of code, and so I created this example which crashes: now my question is about this specific code and the validity of the optimization and whether the crash is allowed. Yes, the `const_cast` aspect is part of it, which is why I linked the other question (to stave off all the "you can't do that with a const comments). – BeeOnRope Feb 05 '19 at 03:20
  • This is about icc's behavior specifically in this scenario: maybe there is some other reason their optimization is allowed? About filing a bug though, is that even possible? icc is a closed-source commercial product, and I don't have any kind of support contract. It doesn't seem like there is any [obvious way](https://software.intel.com/en-us/forums/intel-c-compiler/topic/753992) for the public to file bugs. I have limited interest in icc in any case: this was more of a "oh, can the compiler do that, or is it being naughty?" type of thing. – BeeOnRope Feb 05 '19 at 03:22
  • 3
    @PeterCordes - even when the arrays are distinct, icc [still writes to the destination array](https://godbolt.org/z/UaHj4J). This just seems totally broken, not only from the memory model point of view, but what I pass in a `nullptr` for the second or array, or a shorter array or whatever? Just seems like this blend-based vectorization is broken. – BeeOnRope Feb 05 '19 at 03:53
  • 2
    Ok, yes, that's just 100% broken. It's not easy to come up with a case where this breaks something that someone might actually have written in a program intended to be useful, but it's very easy to come up with scenarios where you can legally run this function with a str1 that never matches, and ICC breaks things. Besides `nullptr`, you could pass a pointer to a `std::atomic` or a mutex where a non-atomic read/rewrite breaks things by inventing writes. re: reporting: google for ICC bug report finds their forum, e.g.: https://software.intel.com/en-us/forums/intel-c-compiler/topic/753767 – Peter Cordes Feb 05 '19 at 04:25
  • 4
    For future readers: If you want to let compilers auto-vectorize this way, you can write source like `str2[i] = x ? replacement : str2[i];` that always writes the string. In theory an optimizing compiler can turn it into a conditional branch in the scalar cleanup or whatever to avoid dirtying memory unnecessarily. (Or if targeting an ISA like ARM32 where a predicated store is possible, instead of only ALU select operations. Or x86 with AVX512 masked stores, where this really *would* be safe.) – Peter Cordes Feb 05 '19 at 04:30
  • 4
    Intel likes to speculate too much. – Language Lawyer Feb 05 '19 at 04:50
  • @Peter Indeed both gcc and clang do vectorize the loops if you write it with an unconditional store with conditional data. However I've seen lots of other cases where they don't vectorize even if you try to teach them that the memory is accessed, so it doesn't always work. AVX2 has masked loads and stores but only with 4 and 8 byte granularity (VPMASKMOV). – BeeOnRope Feb 05 '19 at 05:28
  • does something change if you use static or inline on the function? – phön Feb 05 '19 at 08:40
  • @phön - in fact the function is inlined in my example. The assembly excerpt I show is from the inlined body inside main(), not from the standalone free function. The stuff with argv is needed because of this inlining, otherwise the compiler omits the replace code entirely. – BeeOnRope Feb 05 '19 at 12:05
  • Related question: [What does the C++ compiler do to ensure that different but adjacent memory locations are safe to be used on different threads?](https://stackoverflow.com/questions/53376806/what-does-the-c-compiler-do-to-ensure-that-different-but-adjacent-memory-locat/53378346?noredirect=1#comment93738283_53378346) – see my answer to it in particular, which partially addresses, as you put it, "invented writes". (Note that in this case, though, the compiler knows that `c` cannot be a `const_cast`ed `const` object, and some optimizations are hence possible in a single-threaded MM.) – Arne Vogel Feb 05 '19 at 12:24

1 Answers1

25

Your program is well-formed and free of undefined behaviour, as far as I can tell. The C++ abstract machine never actually assigns to a const object. A not-taken if() is sufficient to "hide" / "protect" things that would be UB if they executed. The only thing an if(false) can't save you from is an ill-formed program, e.g. syntax errors or trying to use extensions that don't exist on this compiler or target arch.

Compilers aren't in general allowed to invent writes with if-conversion to branchless code.

Casting away const is legal, as long as you don't actually assign through it. e.g. for passing a pointer to a function that isn't const-correct, and takes a read-only input with a non-const pointer. The answer you linked on Is it allowed to cast away const on a const-defined object as long as it is not actually modified? is correct.


ICC's behaviour here is not evidence for UB in ISO C++ or C. I think your reasoning is sound, and this is well-defined. You've found an ICC bug. If anyone cares, report it on their forums: https://software.intel.com/en-us/forums/intel-c-compiler. Existing bug reports in that section of their forum have been accepted by developers, e.g. this one.


We can construct an example where it auto-vectorizes the same way (with unconditional and non-atomic read/maybe-modify/rewrite) where it's clearly illegal, because the read / rewrite is happening on a 2nd string that the C abstract machine doesn't even read.

Thus, we can't trust ICC's code-gen to tell us anything about when we've caused UB, because it will make crashing code even in clearly legal cases.

Godbolt: ICC19.0.1 -O2 -march=skylake (Older ICC only understood options like -xcore-avx2, but modern ICC understands the same -march as GCC/clang.)

#include <stddef.h>

void replace(const char *str1, char *str2, size_t len) {
    for (size_t i = 0; i < len; i++) {
        if (str1[i] == '/') {
            str2[i] = '_';
        }
    }
}

It checks for overlap between str1[0..len-1] and str2[0..len-1], but for large enough len and no overlap it will use this inner loop:

..B1.15:                        # Preds ..B1.15 ..B1.14                //do{
        vmovdqu   ymm2, YMMWORD PTR [rsi+r8]                    #6.13   // load from str2
        vpcmpeqb  ymm3, ymm0, YMMWORD PTR [rdi+r8]              #5.24   // compare vs. str1
        vpblendvb ymm4, ymm2, ymm1, ymm3                        #6.13   // blend
        vmovdqu   YMMWORD PTR [r8+rsi], ymm4                    #6.13   // store to str2
        add       r8, 32                                        #4.5    // i+=32
        cmp       r8, rax                                       #4.5
        jb        ..B1.15       # Prob 82%                      #4.5   // }while(i<len);

For thread-safety, it's well known that inventing write via non-atomic read/rewrite is unsafe.

The C++ abstract machine never touches str2 at all, so that invalidates any arguments for the one-string version about data-race UB being impossible because reading str at the same time another thread is writing it was already UB. Even C++20 std::atomic_ref doesn't change that, because we're reading through a non-atomic pointer.

But even worse than that, str2 can be nullptr. Or pointing to close to the end of an object (which happens to be stored near the end of a page), with str1 containing chars such that no writes past the end of str2 / the page will happen. We could even arrange for only the very last byte (str2[len-1]) to be in a new page, so that it's one-past-the-end of a valid object. It's even legal to construct such a pointer (as long as you don't deref). But it would be legal to pass str2=nullptr; code behind an if() that doesn't run doesn't cause UB.

Or another thread is running the same search/replace function in parallel, with a different key/replacement that will only write different elements of str2. The non-atomic load/store of unmodified values will step on modified values from the other thread. It's definitely allowed, according to the C++11 memory model, for different threads to simultaneously touch different elements of the same array. C++ memory model and race conditions on char arrays. (This is why char must be as large as the smallest unit of memory the target machine can write without a non-atomic RMW. An internal atomic RMW for a byte stores into cache is fine, though, and doesn't stop byte-store instructions from being useful.)

(This example is only legal with the separate str1/str2 version, because reading every element means the threads would be reading array elements the other thread could be in the middle of writing, which is data-race UB.)

As Herb Sutter mentioned in atomic<> Weapons: The C++ Memory Model and Modern Hardware Part 2: Restrictions on compilers and hardware (incl. common bugs); code generation and performance on x86/x64, IA64, POWER, ARM, and more; relaxed atomics; volatile: weeding out non-atomic RMW code-gen has been an ongoing issue for compilers after C++11 was standardized. We're most of the way there, but highly-aggressive and less-mainstream compilers like ICC clearly still have bugs.

(However, I'm pretty confident that Intel compiler devs would consider this a bug.)


Some less-plausible (to see in a real program) examples that this would also break:

Besides nullptr, you could pass a pointer to (an array of) std::atomic<T> or a mutex where a non-atomic read/rewrite breaks things by inventing writes. (char* can alias anything).

Or str2 points to a buffer that you've carved up for dynamic allocation, and the early part of str1 will have some matches, but later parts of str1 won't have any matches, and that part of str2 is being used by other threads. (And for some reason you can't easily calculate a length that stops the loop short).


For future readers: If you want to let compilers auto-vectorize this way:

You can write source like str2[i] = x ? replacement : str2[i]; that always writes the string in the C++ abstract machine. IIRC, that lets gcc/clang vectorize the way ICC does after doing its unsafe if-conversion to blend.

In theory an optimizing compiler can turn it back into a conditional branch in the scalar cleanup or whatever to avoid dirtying memory unnecessarily. (Or if targeting an ISA like ARM32 where a predicated store is possible, instead of only ALU select operations like x86 cmov, PowerPC isel, or AArch64 csel. ARM32 predicated instructions are architecturally a NOP if the predicate is false).

Or if an x86 compiler chose to use AVX512 masked stores, that would also make it safe to vectorize the way ICC does: masked stores do fault suppression, and never actually store to elements where the mask is false. (When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?).

vpcmpeqb k1, zmm0, [rdi]   ; compare from memory into mask
vmovdqu8 [rsi]{k1}, zmm1   ; masked store that only writes elements where the mask is true

ICC19 actually does basically this (but with indexed addressing modes) with -march=skylake-avx512. But with ymm vectors because 512-bit lowers max turbo too much to be worth it unless your whole program is heavily using AVX512, on Skylake Xeons anyway.

So I think ICC19 is safe when vectorizing this with AVX512, but not AVX2. Unless there are problems in its cleanup code where it does something more complicated with vpcmpuq and kshift / kor, a zero-masked load, and a masked compare into another mask reg.


AVX1 has masked stores (vmaskmovps/pd) with fault-suppression and everything, but until AVX512BW there's no granularity narrower than 32 bits. The AVX2 integer versions are only available in dword/qword granularity, vpmaskmovd/q.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • 1
    I don't think you even need to use the char aliasing `std::atomic` objects to come up with a memory model problem: imagine the scenario where some thread T2 has properly synchronized exclusive access to str2 (eg because it was published from T1 to T2 via some mechanism), and is doing writes. Meanwhile T1 is running the replace function with str2, but this is allowed because you've arranged that no accessed occur: but the invented writes clobber stuff T2 is doing! – BeeOnRope Feb 05 '19 at 06:08
  • Alternately consider the case where T1 and T2 share some char array, but by convention T1 only ever accesses odd elements and T2 even. This is allowed because array elements are separate objects per the memory model (eg this means byte writes can't use a RMW of a larger word). Now either or both of T1/2 use the shared array as str2 organizing via str1 to only access "their" elements of str2. Seems legal. It will blow up on occasion tho because the merge is effectively a wide RMW which will clobber adjacent elements. No std::atomic aliasing through a char needed! – BeeOnRope Feb 05 '19 at 06:12
  • @BeeOnRope: oh yes, duh, another thread could simply be running the same search/replace with a key that's known to write a disjoint set of elements. Forgot about that case. – Peter Cordes Feb 05 '19 at 07:14
  • This a good answer, but I'm having trouble accepting it because I feel it lacks a clear answer to the headline question. Maybe a couple sentences at the start to answer the question? The first sentence is a bit confusing to me, maybe it could be clarified. I guess you are saying something like "The way ICC compiles this code is not evidence for UB *in your code* in C++...", right? – BeeOnRope Feb 18 '19 at 16:10
  • @BeeOnRope: good point. Added a section at the start. It might possibly be an overstatement that an `if(false)` can save you from everything except ill-formed code, but I can't think of a counterexample. – Peter Cordes Feb 18 '19 at 18:39