Where is the load barrier for the volatile statement?

Question

I wrote this simple Java program:

package com.salil.threads;

public class IncrementClass {

    static volatile int j = 0;
    static int i = 0;

    public static void main(String args[]) {

        for(int a=0;a<1000000;a++);
        i++;
        j++;            
    }       
}

This generate the following disassembled code for i++ and j++ (remaining disassembled code removed):

  0x0000000002961a6c: 49ba98e8d0d507000000 mov       r10,7d5d0e898h
                                                ;   {oop(a 'java/lang/Class' = 'com/salil/threads/IncrementClass')}
  0x0000000002961a76: 41ff4274            inc       dword ptr [r10+74h]
                                                ;*if_icmpge
                                                ; - com.salil.threads.IncrementClass::main@5 (line 10)
  0x0000000002961a7a: 458b5a70            mov       r11d,dword ptr [r10+70h]
  0x0000000002961a7e: 41ffc3              inc       r11d
  0x0000000002961a81: 45895a70            mov       dword ptr [r10+70h],r11d
  0x0000000002961a85: f083042400          lock add  dword ptr [rsp],0h
                                                ;*putstatic j
                                                ; - com.salil.threads.IncrementClass::main@27 (line 14)

This is what I understand about the following assembly code:

mov r10,7d5d0e898h : Moves the pointer to the IncrementClass.class to register r10
inc dword ptr [r10+74h] : Increments the 4 byte value at the address at [r10 + 74h],(i.e. i)
mov r11d,dword ptr [r10+70h] :Moves the 4 value value at the address [r10 + 70h] to register r11d (i.e move value of j to r11d)
inc r11d : Increment r11d
mov dword ptr [r10+70h],r11d : write value of r11d to [r10 + 70h] so it is visible to other threads -lock add dword ptr [rsp],0h : lock the memory address represented by the stack pointer rsp and add 0 to it.

JMM states that before each volatile read there must be a load memory barrier and after every volatile write there must be a store barrier. My question is:

Why isn't there a load barrier before the read of j into r11d?
How does the lock and add to rsp ensure the value of j in r11d is propogated back to main memory. All I read from the intel specs is that lock provides the cpu with an exclusive lock on the specified memory address for the duration of the operation.

That code is super-bad. `lock inc dword [r10+70h]` would do everything that load/inc/store/full-barrier does, and more (i.e. actually be atomic). It would be at least as fast, and many fewer code bytes. `lock add [rsp], 0` is a full-barrier because every `lock`ed instruction is. There's debate about whether MFENCE or an otherwise no-op locked insn to stack memory (which should be in the E state in L1 already) is better. MFENCE has worse throughput, but fewer uops so maybe less impact on surrounding instructions when a chain of MFENCE isn't *all* you're doing. — Peter Cordes, Feb 09 '16 at 15:08
`mov r10, imm64` is also suspicious. That's inside the loop??? Is this *optimized* code from a JIT? Is `inc r11d` the loop counter, or is that at least kept in a register? — Peter Cordes, Feb 09 '16 at 15:10
@SalilSurendran I know this is old but shouldn't the statement be *after each volatile read there must be a load memory barrier*? after the read, not before — Eugene, Dec 22 '16 at 11:07

score 7 · Answer 1 · edited May 23 '17 at 11:59

7

Intel Processor x86 has a strong memory model.

Therefore all barrier StoreStore , LoadLoad, LoadStore are no-op on x86. Except StoreLoad which can be realized via mfence or cpuid or locked insn. Which you can already confirm with your assembly code. Other barriers just mean restriction to compilers optimization and transformation so that they don't break java memory model spec.

As you ran on intel Processor i am assuming its x86.

Please read

Lock is not an instruction but moreof a instruction prefix (behaves as a storeLoad barrier).

edited May 23 '17 at 11:59

Community

1
1

answered Jan 05 '15 at 07:06

veritas

2,326
1
16
30

I had read most of the articles above before posting.Why is i++ not atomic or thread safe. If you look at the instruction incrementing i "inc dword ptr [r10+74h]" this should directly write to memory and every other thread should be able to see this value. From what I understand when the CPU writes to memory as above this value is cached in the cache line and doesn't go all the way to memory and so an explicit instruction is needed for it to write to memory. Which I believe is the LOCK statement but how does a LOCK on the stack pointer ensure the value in the cache gets written to memory. – Salil Surendran Jan 05 '15 at 09:16
please have a look at this ans What does the "lock" instruction mean in x86 assembly?. Its very clear. Inc instruction is not a read-modify-write operation on its own its just incremented the value in the register, whereas if prefixed by lock it surely would have been. But compiler writer realized StoreLoad barrier by inst lock add 0 value to have that effect. Read the ans its pretty clear – veritas Jan 05 '15 at 18:03
1

Ok. I believe that the any instruction prefixed by a LOCK statement acts as a StoreLoad barrier since it prevents reordering of loads before stores. Cache coherence mechanism of x86 takes care of the fact that all CPUs see this value that has been written to memory. Is my understanding correct? – Salil Surendran Jan 08 '15 at 20:50
yes see for C++ similar mapping http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html – veritas Jan 08 '15 at 20:59
@veritas does this comes from the fact that Intel has this rule: *reads are not re-ordered with any reads*? so a volatile read (or any other) will not be re-ordered – Eugene Dec 22 '16 at 11:10

score 1 · Answer 2 · answered Jan 05 '15 at 03:23

1

volatile keyword in Java only guarantee that the thread local copies and caches would be skipped and the value would be loaded directly from main memory or write to main memory. However it doesn't contains locking mechanism. Thus reading from volatile, or writing to volatile, is atomic, but a series of read and write operations, like your above

j++

is NOT atomic, because some other thread can modify the value of j between the read and write to the variable in main memory. To achieve atomic increment you need to use CAS operations which is wrapped in Atomic classes in java, like AtomicInteger etc. Alternatively, if you prefer low level programming, you can use atomic methods in Unsafe class E.g. Unsafe.compareAndSwapInt etc.

answered Jan 05 '15 at 03:23

Alex Suo

2,661
1
10
21

The argument about "thread local copies and caches" is backed by [Wikipedia: volatile (computer programming)](http://en.wikipedia.org/wiki/Volatile_%28computer_programming%29). Tracing the original meaning back to the `C` language and interrupt handlers can explain also the semantics borrowed and implemented in the `Java` language – xmojmr Jan 05 '15 at 06:10
1

"volatile keyword in Java only guarantee that the thread local copies and caches would be skipped and the value would be loaded directly from main memory or write to main memory" - the jmm doesn't say anything like this and no production level JVM implements volatile like that. The reason there are no explicit barriers on x86 to be seen is that the used instructions already give the necessary visibility guarantees. – Voo Jan 05 '15 at 06:18
1

@Voo http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile Maybe I am not accurate in terminology and description but this is the same semantics. – Alex Suo Jan 05 '15 at 06:36
1

The JMM only talks visibility guarantees and reordering limitations but isn't concerned how those are implemented. On x86 with its strong memory model and cache coherency writing back to memory is absolutely not necessary, would have horrible performance and is generally not done. Java's volatile and C's volatile have also nothing in common apart from the name (C's version is completely useless for multithreaded programming, while Java's doesn't help with ISRs or hardware). See the [JSR-113 cookbook](http://gee.cs.oswego.edu/dl/jmm/cookbook.html) for information how it's generally implemented. – Voo Jan 05 '15 at 07:31
I understand that the above operation is not atomic. I am asking how do the assembly instructions bypass the cache mechanism and how does the LOCK statement ensure that the value is written back to memory and not kept in cache. Please read my comment to the previous answer. – Salil Surendran Jan 05 '15 at 09:22
1

@Salil Wrong assumption, the cache isn't bypassed, you can be pretty sure the value isn't written back to memory (for the same reason nobody uses write-through caches). And the reason the `LOCK` does give you the necessary barriers is because Intel specified the ISA in such a way (lots of ways to do so, snooping for one). – Voo Jan 05 '15 at 15:44

score 0 · Answer 3 · answered Feb 09 '16 at 14:55

0

The barrier may be optimized out by your JIT compiler since your program is single-threaded(there is only a thread-the main thread), just like a lock under a single-threaded environment can be optimized out. This optimization is independent of processor architecture.

answered Feb 09 '16 at 14:55

user2351818

517
2
14

`lock add [rsp], 0` is a full barrier, same as `MFENCE`. It's weird that the increment of `[r10+74]` isn't atomic, though. – Peter Cordes Feb 09 '16 at 15:11

Where is the load barrier for the volatile statement?

3 Answers3

Intel Processor x86 has a strong memory model.