Efficient C/C++ multi-threaded program to partition and process data

Question

I am trying to solve this assignment and I am seeing that my multi-threaded solution is performing worse than my single threaded solution.

Assignment:

Create a utility that will perform a simple XOR cryptographic transform on a given set of data - an XOR stream encryptor. The encryption key will be provided by the binary data in an arbitrarily-sized external file. The size of the key in bytes will dictate the block size. The plaintext data will be given on stdin, where the utility will then break it into block-sized sections, XOR it against the key, and write the cipher text to stdout. After each block is processed, the key should be rotated to the left by one bit to make a new key. This means that the key will repeat every N blocks, where N is the number of bits in the key. The plaintext data need not be a multiple of the block size in length, nor should it be assumed to be ASCII or Unicode text. It is also valid for the plaintext to be extremely large, far exceeding the available memory+swap space for the system.

In addition to correctly performing the transform, the utility should be able to take advantage of multi-core or multi-processor machines by implementing a multi-threading scheme that processes multiple blocks of plaintext in parallel. The number of threads should be specified as a command-line argument and, regardless of how many are used, the output cipher text should remain the same. All error and/or status information should be printed on stderr.

The utility should be written in C or C++ and should build on a UNIX platform. Please make sure that your submission compiles and runs correctly with a variety of different inputs.

Required command-line options:

encryptUtil [-n #] [-k keyfile] -n #

Number of threads to create -k keyfile Path to file containing key

Example input/output:

$ hexdump -C keyfile 
00000000  f0 f0

|??|

$ hexdump -C plaintext 
00000000  01 02 03 04 11 12 13 14   

   |........|

$ cat plaintext | encryptUtil -n 1 -k keyfile >

cyphertext

$ hexdump -C ciphertext 
00000000  f1 f2 e2 e5 d2 d1 94 93   

   |??????..|

What would the most efficient way to go about this?

My approach:

I am generating all the keys ahead of time
Generate all keys ahead of time: Given that the input size can be very large, rotating the key for every XOR block operation can be expensive, so decided to generate all combinations of keys (key chain) ahead of time, and index into the key chain for every XOR block operation. Read data in batches of thread_count x block_size : given that the input data can be very large, it is better to read the data in batches and process them. Partitioning the data also lends itself well if the solution ever needs to be modified to technologies like OpenCL to use multiple processing units.
Pipelining: Given that this program has two IO operations, 1> input, 2>output. It seemed like pipelining would be a good solution.

We read batches of data of size (thread_count x block_size) in the main thread, and we send it to a dispatcher which gets a LOCAL COPY of the input data which it can process. The main thread is freed to read more data, while the dispatcher processes the input plain text from the local copy.

The dispatcher spawns N threads which actually do the XOR operation, they indexing into the keychain and generate their partial outputs. When all the workers have generated their partial outputs, they signal the dispatcher, which writes the data to stdout.

All this happens while the main thread prepares the next batch of data to be processed.

The approach looked logical, but back fired :( , from a performance point of view (atleast for this specific problem), because the reading from stdin and stdout is fairly fast, I are not able to see the benefit of using the pipeline. It looks like the data movement and the synchronization overhead is causing the program to run slower than a single threaded program, which is also included as a solution I am submitting (also uses pre-generated keys).

My solution is at

repo: contains the problem description and the two solutions https://github.com/vjtron/encryptUtil

multi threaded solution https://github.com/vjtron/encryptUtil/tree/master/source/encryptUtil_pthread

If your code is running as such, you might be better at code review. But please read their FAQ before posting. — too honest for this site, Aug 16 '15 at 03:03
Hi @Olaf, here is the code repo: https://github.com/vjtron/encryptUtil — vjtron, Aug 16 '15 at 03:39

score 2 · Answer 1 · edited May 23 '17 at 12:08

The issue here is going to be minimising memory copies. One core of a modern Intel (or AMD) CPU can saturate main-memory bandwidth. e.g. Haswell: ~26GB/s max theoretical main memory bandwidth. 26GB/s / 8B = 3.25GHz clock speed minimum to saturate DRAM, in code that on average touches 8B of new memory per clock (repeated access to the same 8B hits in L1 cache). This is fairly easy to achieve with just scalar code, esp. if you're writing, not just reading. (By comparison, each Haswell core can sustain two (up to) 32B reads and one (up to) 32B write per cycle to L1 D$. Each core has its own 32k L1 and 256k L2$.)

If your solution to multi-threading involves an extra memcpy, you're completely shooting yourself in the foot. To have any hope, you need each thread to do its own read from the file descriptor. A read system call makes the kernel memcpy file data into your buffer. This happens on the CPU that ran the system call, so the data should be in cache afterwards.

Probably a ~16k buffer size for reads would be good, as that's half of L1D$ (on Intel designs. AMD uses 16k L1 D$, so maybe even smaller buffers would be good there, since L2 is large but slower than Intel's L2). (pronounce "$" as "cache"). However, we can still keep up with DRAM with L2 bandwidth from multiple cores, and larger buffers would reduce synchronization / system call overhead. So perhaps 128k would be a better choice.

This is not how you'd normally go about doing I/O in a multithreaded program. I think normally you'd have an I/O thread, and worker threads would work on buffers read in by the I/O thread.

(See below for another design I thought of that would still use an I/O thread, but get data to worker threads more efficiently than read + memcpy.)

We're trying to maximize cache benefits by having each thread be the only thread to touch its block of memory. It's possible to imagine a situation where you wanted to multi-thread something that was a little bit more computationally intensive; not quite memory-bound on a single core, but easily memory-bound on multiple cores. This makes it more fun than just multi-threading with the goal of not being slower than a simple single-threaded implementation!

Since you apparently have to read your input from stdin, instead of by memory-mapping a file, you need to get each thread to do its own reads into its own buffer, so the data will be in L1/L2$ of the core running that thread. You have to get output blocks written in order, and for that you need to know which thread has which chunk of data.

I'm thinking it might be ideal to just have every thread read stdin, and then figure out which thread got which block to order them for output. The problem is, read(2) doesn't return a file position or anything. Using lseek(2) to query the position and then read(2) would introduce a race condition, and so would an atomic counter (where you get the old or new value as part of atomically adding 8k to it). Reading from an unseekable stream (which doesn't have a file position), you'd need a counter anyway.

You're limited by memory bandwidth anyway, so it's not a problem for threads to block each other to some degree. I have an idea that requires a lock only for input, with output ordering thanks to an atomic counter:

input lock and position counter (both in the same cache line)
an output position counter (atomic). (In a different cache line from the input lock, to avoid false sharing. Although threads do need to acquire the input lock right after they increment the output pos counter.)

each thread:

while (42) {
     acquire(input.lock);
     local blockpos = input.pos;
     ssize_t count = read(stdin, buf, blksz);
     if (count != blsz) {
         // special case, maybe try to fill the rest of the buffer if it wasn't EOF?
     }
     input.pos += count;
     release(input.lock);

     process(buf, blockpos);  // needs the position to know what key position to start with

     spin_until(output.pos == blockpos);

     // acquire(output.lock);  // no other thread will have the same blockpos
     write(stdout, buf, count);
     store_fence();   // not sure this is needed; the kernel probably makes sure that write ordering is globally visible.
     // So the output.pos increment couldn't be globally visible before write's change to the file position.
     output.pos += count;  // make sure this is an atomic increment
     // release(output.lock);
}

I'm not sure how best to implement the spin_until(output.pos == blockpos);, to get a thread to wait for the atomic counter for output position to match the input block it has. (I know the theory on multi-threading, but haven't had much implementation experience.)

If you were reading from a seekable file (but couldn't mmap it), you could use the pread(2) system call. It's like read(2), but with an off_t offset argument. Every thread could get the Nth block, where N is the thread count. You'd need locking to control ordering on write. Unless you just use pwrite, then maybe every 16 blocks, make sure threads haven't gotten out of step with each other, because sequential I/O is faster than random, even on SSDs. You'd have to benchmark this: you might be making extra work for the kernel I/O scheduler. Esp. if you weren't rewriting a file in place, the file would temporarily have holes when you write beyond EOF instead of at EOF. The actual I/O scheduler probably doesn't get involved until a while later, when the kernel eventually flushes the dirty pages you wrote a while ago. (Which will be contiguous, because you don't let your threads get too far apart).

Alternative: still an I/O thread but with less copying

This still involves the data bouncing around between caches of different CPU cores, but usually that's not the bottleneck on problems that are CPU intensive enough to be worth parallelizing.

You could still implement it with one I/O thread, but reduce copying by having the I/O thread read into buffers for the worker threads directly. So each worker thread would have two buffers of maybe 64k each.

Worker threads: When the worker thread sees a flag marking one of the two buffers as ready for processing, it does a load-fence, processes it, does a store-fence, and then clears the flag. i.e. load flags with a load(std::memory_order_acquire), and write flags with store(std::memory_order_release). (StoreStore, LoadLoad, and LoadStore barriers are implicit (zero instructions) on x86, but you still need to put it in the code to stop compiler reordering. Not to mention, so your code works on other architectures.)
I/O thread: read into an empty buffer for a worker thread, store the stream position (so the worker knows what key rotation to start at), then set a flag with a store(std::memory_order_release). Make sure to check the flags with std::memory_order_acquire when looking for empty buffers.

I think you can get away without using a StoreLoad barrier anywhere, because we're not taking a lock and then reading some protected data. Instead, we're reading a flag and then reading / rewriting data. So the StoreLoad case doesn't come into it. Even if we weren't rewriting, but rather just writing, it we'd only need a LoadStore barrier along with our LoadLoad barrier.

In the same order you read buffers, write them when their flags are cleared. Ideally, write a buffer and then re-fill it right away, so it's still in cache on the I/O thread's CPU.

Data will move:

disk -> DRAM (DMA, probably doesn't enter cache at all). (Skip this step if the file is in the disk cache.)
DRAM(pagecache) -> L3, and L2/L1 of whatever core the I/O thread is running on.

2.1 L1->registers : read system call reading from pagecache in the I/O thread

2.2 registers -> another part of L1/L2 (modified (M state)). read writing to our buffer in the I/O thread.

2.x. This is where your code does an extra copy, L2->registers->L2 in the I/O thread.
L2(IO-CPU) -> L3 when the worker thread's CPU needs those cache lines, and coherency traffic finds another core had them in E or M state.
L3 -> L2/L1 and registers of the worker thread's CPU
modified (M) in L1/L2 of the worker thread CPU, when it writes the modified buffer.
flushed to L3 by coherency traffic when the I/O thread's write system call reads it
L3 -> L2/L1 of the I/O thread. (write system call)
L2/L1 -> registers -> L1/L2 I/O thread CPU. (The write ops of the write system call, copying our buffer to the pagecache.)
L2 -> L3 and DRAM (pagecache) when cache pressure eventually evicts the cache lines holding already-written output data gets flushed.
DRAM(pagecache) -> disk when the kernel flushes dirty output-file pages

If we're lucky, and keep our buffers small, most of the time the extra copies of the data (i.e. our worker-thread buffers) don't make it to DRAM, just L3. Data is transferred between cores by cache coherency traffic. This is one of the major benefits of the large last-level-cache on recent Intel designs. It's a cache-coherency buffer that prevents round-trips to main memory. This might not be mentioned in Ulrich Drepper's What Every Programmer Should Know About Memory, since it's from 2007.

I forget if cores can transfer cache lines directly from another core's L2, instead of writing to L3, then reading from L3. I think maybe, but bouncing cache lines between cores is still slow.

Your extra memcpy is just one extra line in my list, but it uses up cache space, and is likely to evict data back to main memory. It may also use extra L3 shared-bandwidth just for the copying. I think the main problem here was your buffers were probably too big, and ended up hitting main memory. (I didn't check your git repo.)

Key rotation

If the key is 8, 16, 32, or 64bits, then rotating it is extremely cheap (on x86). Rotating a register by 1 is a single-cycle instruction. See https://stackoverflow.com/a/31488147/224132 for how to code a rotate in C that actually compiles to the machine instruction without any C undefined-behaviour (or extra instructions generated from the code that avoids the UB). This is only a good idea for 64b keys, or maybe 32. Others are too short, since you need a different rotation of the small key for each part of a 64b value.

If the key isn't one of those sizes, then it will be more expensive. Your solution of unrolling the key-rotate until it repeats is a good way to go, otherwise you're at the mercy of the compiler to generate optimal rotate and rotate-with-carry instructions.

Even if it's an even 128b, rotating that would probably require testing the high bit of the low 64, then rotate-with-carry the high 64, then rotate-with-carry the low 64. RCL r,1 is fairly efficient, only ~3x the cost of a ROL r, 1 (which is similar to a single add). If it's some irregular size, like 11 bytes, a pair of 64b and 32b rotate-with-carry instructions won't do the trick.

If you do fully unroll they key repetition, then up to 15* 16B will fit in vector registers. (or 15 * 32B with AVX). You need one reg free as a scratch. Hopefully a good compiler will realize all this and vectorize XORing a key array onto your buffer. Although as long as the key is that small, it'll still fit in L1 cache, and those extra L1 reads (and the instructions to do them) won't be a problem if you're multithreading, and limited by main memory.

update: EOF has a great suggestion: Just unroll the key rotations until you have a whole number of bytes to worry about. If you have a buffer with 8 keys in a row, you should be able to get any rotation of the key you need, from somewhere in the buffer.

I'd recommend against caching all key rotations, since that'd take `key_bits*key_bits` of memory. I'd rather cache 8 copies and do the byte-shuffling as needed. — EOF, Aug 16 '15 at 13:10
That's an *excellent* suggestion. I hadn't really given the very-large-key problem much thought, but this is potentially useful for a key that fits in registers, as well as the very-large-key case. (`palignr` / `shrd` can shift bytes between registers.) On a system with fast unaligned memory access (like x86), you can have 9 copies, and sequentially access 8 * `key_bits` at whatever byte offset you need. Or, do the odd part of the size up to 16B alignment, and then do wide XORs between the key and buffer. — Peter Cordes, Aug 16 '15 at 13:56

score 1 · Answer 2 · answered Aug 16 '15 at 09:57

1

The algorithm you apply to each block is so simple that the cost of sequential I/O will dominate; running the XOR loops in multiple threads will not make the program any faster. If you were reading the input from a file it could be beneficial to use threads to read, encrypt, and write different blocks concurrently.

answered Aug 16 '15 at 09:57

Joni

101,441
12
123
178

*If you were reading the input from a file it could be beneficial to use threads to read, encrypt, and write different blocks concurrently.* Tell that to the (single) drive head ! – High Performance Mark Aug 16 '15 at 10:00
Not all files are stored on rotating disks. – Joni Aug 16 '15 at 10:01
Exactly my thoughts. The memcpy part of reading it from stdin is the dominant cost, even if it's not coming from disk. Getting the data into L1 cache of another CPU core will just slow things down. Threading will only help (or at least not slow things down) if you can memory-map your input somehow. – Peter Cordes Aug 16 '15 at 11:08

Efficient C/C++ multi-threaded program to partition and process data

2 Answers2

Alternative: still an I/O thread but with less copying

Key rotation