Why does Knuth use this clunky decrement?

Question

I'm looking at some of Prof. Don Knuth's code, written in CWEB that is converted to C. A specific example is dlx1.w, available from Knuth's website

At one stage, the .len value of a struct nd[cc] is decremented, and it is done in a clunky way:

  o,t=nd[cc].len-1;
  o,nd[cc].len=t;

(This is a Knuth-specific question, so maybe you already know that "o," is a preprocessor macro for incrementing "mems", which is a running total of effort expended, as measured by accesses to 64-bit words.) The value remaining in "t" is definitely not used for anything else. (The example here is on line 665 of dlx1.w, or line 193 of dlx1.c after ctangle.)

My question is: why does Knuth write it this way, rather than

nd[cc].len--;

which he does actually use elsewhere (line 551 of dlx1.w):

oo,nd[k].len--,nd[k].aux=i-1;

(And "oo" is a similar macro for incrementing "mems" twice -- but there is some subtlety here, because .len and .aux are stored in the same 64-bit word. To assign values to S.len and S.aux, only one increment to mems would normally be counted.)

My only theory is that a decrement consists of two memory accesses: first to look up, then to assign. (Is that correct?) And this way of writing it is a reminder of the two steps. This would be unusually verbose of Knuth, but maybe it is an instinctive aide-memoire rather than didacticism.

For what it's worth, I've searched in CWEB documentation without finding an answer. My question probably relates more to Knuth's standard practices, which I am picking up bit by bit. I'd be interested in any resources where these practices are laid out (and maybe critiqued) as a block -- but for now, let's focus on why Knuth writes it this way.

Everything Knuth does is clunky and optimized for confusion. — Boann, Dec 31 '18 at 16:39
@Boann I'm curious what makes you say that; could you elaborate? Personally I've always found Knuth's writing clear and delightful, starting with my first encounter (_Concrete Mathematics_) down to a few of his papers and parts of _The Art of Computer Programming._ (His programming style does seem to have diverged from the mainstream in the 1970s, i.e. he's found different solutions than most others, but wonder what exactly you're referring to.) — ShreevatsaR, Jan 01 '19 at 02:48
@ShreevatsaR I'm surprised to hear someone say that! I struggled to make any headway with TAOCP, and gave up and tossed the book. To me, Knuth is someone who really wishes programming was less language and more abstract recreational mathematics, so he pretends it's that way, despite that being (to me) completely confusing and impractical. Look at those absurd variable names in the question; `o`, `t`, `nd`, `cc`, `aux`. It's incomprehensible. — Boann, Jan 01 '19 at 15:43
@Boann Well your experience is yours and I can't dispute it but IMO Knuth is the most “real-world” of the mathematicians/algorithmists: the one most explicitly *not* pretending that abstraction is real, but analyzing what might happen with real programs on real computers (e.g. he does not analyze only Big-O asymptotics, but down to the constant factor). Of course TAOCP is about mathematical analysis of algorithms (see preface), but IMO it's more “concrete” than any of its successors/alternatives (e.g. which of them include assembly programs to study effects of cache, RAM size, pipelining etc?) — ShreevatsaR, Jan 03 '19 at 13:13

ShreevatsaR · Accepted Answer · 2018-12-30T19:56:17.397

A preliminary remark: with Knuth-style literate programming (i.e. when reading WEB or CWEB programs) the “real” program, as conceived by Knuth, is neither the “source” .w file nor the generated (tangled) .c file, but the typeset (woven) output. The source .w file is best thought of as a means to produce it (and of course also the .c source that's fed to the compiler). (If you don't have cweave and TeX handy; I've typeset some of these programs here; this program DLX1 is here.)

So in this case, I'd describe the location in the code as module 25 of DLX1, or subroutine "cover":

Anyway, to return to the actual question: note that this (DLX1) is one of the programs written for The Art of Computer Programming. Because reporting the time taken by a program “seconds” or “minutes” becomes meaningless from year to year, he reports how long a program took in number of “mems” plus “oops”, that's dominated by the “mems”, i.e. the number of memory accesses to 64-bit words (usually). So the book contains statements like “this program finds the answer to this problem in 3.5 gigamems of running time”. Further, the statements are intended to be fundamentally about the program/algorithm itself, not the specific code generated by a specific version of a compiler for certain hardware. (Ideally when the details are very important he writes the program in MMIX or MMIXAL and analyses its operations on the MMIX hardware, but this is rare.) Counting the mems (to be reported as above) is the purpose of inserting o and oo instructions into the program. Note that it's more important to get this right for the “inner loop” instructions that are executed a lot of times, such as everything in the subroutine cover in this case.

This is elaborated in Section 1.3.1′ (part of Fascicle 1):

Timing. […] The running time of a program depends not only on the clock rate but also on the number of functional units that can be active simultaneously and the degree to which they are pipelined; it depends on the techniques used to prefetch instructions before they are executed; it depends on the size of the random-access memory that is used to give the illusion of 2⁶⁴ virtual bytes; and it depends on the sizes and allocation strategies of caches and other buffers, etc., etc.

For practical purposes, the running time of an MMIX program can often be estimated satisfactorily by assigning a fixed cost to each operation, based on the approximate running time that would be obtained on a high-performance machine with lots of main memory; so that’s what we will do. Each operation will be assumed to take an integer number of υ, where υ (pronounced “oops”) is a unit that represents the clock cycle time in a pipelined implementation. Although the value of υ decreases as technology improves, we always keep up with the latest advances because we measure time in units of υ, not in nanoseconds. The running time in our estimates will also be assumed to depend on the number of memory references or mems that a program uses; this is the number of load and store instructions. For example, we will assume that each LDO (load octa) instruction costs µ + υ, where µ is the average cost of a memory reference. The total running time of a program might be reported as, say, 35µ+ 1000υ, meaning “35 mems plus 1000 oops.” The ratio µ/υ has been increasing steadily for many years; nobody knows for sure whether this trend will continue, but experience has shown that µ and υ deserve to be considered independently.

And he does of course understand the difference from reality:

Even though we will often use the assumptions of Table 1 for seat-of-the-pants estimates of running time, we must remember that the actual running time might be quite sensitive to the ordering of instructions. For example, integer division might cost only one cycle if we can find 60 other things to do between the time we issue the command and the time we need the result. Several LDB (load byte) instructions might need to reference memory only once, if they refer to the same octabyte. Yet the result of a load command is usually not ready for use in the immediately following instruction. Experience has shown that some algorithms work well with cache memory, and others do not; therefore µ is not really constant. Even the location of instructions in memory can have a significant effect on performance, because some instructions can be fetched together with others. […] Only the meta-simulator can be trusted to give reliable information about a program’s actual behavior in practice; but such results can be difficult to interpret, because infinitely many configurations are possible. That’s why we often resort to the much simpler estimates of Table 1.

Finally, we can use Godbolt's Compiler Explorer to look at the code generated by a typical compiler for this code. (Ideally we'd look at MMIX instructions but as we can't do that, let's settle for the default there, which seems to be x68-64 gcc 8.2.) I removed all the os and oos.

For the version of the code with:

  /*o*/ t = nd[cc].len - 1;
  /*o*/ nd[cc].len = t;

the generated code for the first line is:

  movsx rax, r13d
  sal rax, 4
  add rax, OFFSET FLAT:nd+8
  mov eax, DWORD PTR [rax]
  lea r14d, [rax-1]

and for the second line is:

  movsx rax, r13d
  sal rax, 4
  add rax, OFFSET FLAT:nd+8
  mov DWORD PTR [rax], r14d

For the version of the code with:

  /*o ?*/ nd[cc].len --;

the generated code is:

  movsx rax, r13d
  sal rax, 4
  add rax, OFFSET FLAT:nd+8
  mov eax, DWORD PTR [rax]
  lea edx, [rax-1]
  movsx rax, r13d
  sal rax, 4
  add rax, OFFSET FLAT:nd+8
  mov DWORD PTR [rax], edx

which as you can see (even without knowing much about x86-64 assembly) is simply the concatenation of the code generated in the former case (except for using register edx instead of r14d), so it's not as if writing the decrement in one line has saved you any mems. In particular, it would not be correct to count it as a single one, especially in something like cover that is called a huge number of times in this algorithm (dancing links for exact cover).

So the version as written by Knuth is correct, for its goal of counting the number of mems. He could also write oo,nd[cc].len--; (counting two mems) as you observed, but perhaps it might look like a bug at first glance in that case. (BTW, in your example in the question of oo,nd[k].len--,nd[k].aux=i-1; the two mems come from the load and the store in --; not two stores.)

Good answer! -- thanks for putting in the effort. I still don't know why Knuth didn't write "oo,nd[cc].len--;". (He could assume that readers will catch up with the idea that decrement costs two mems.) I accept that this is a small stylistic question. (I wasn't sure that it was so small when I first asked.) — Ed Wynn, Dec 30 '18 at 20:01
@EdWynn Yeah… Another secret: sometimes literate programming is just a method of organizing code, and does not necessarily mean that a program has had a lot of “spit and polish” applied, i.e. literate programs can still be hurriedly written, have bugs, etc. There are various things possible here, e.g. when he wrote it he was thinking of something else, maybe expected to add more instructions in between, maybe wanted to prevent the compiler from optimizing away the store that happens immediately… probably doesn't matter :-) — ShreevatsaR, Dec 30 '18 at 23:22

score 3 · Answer 2 · answered Dec 30 '18 at 17:59

3

This whole practice seems to be based on a mistaken idea/model of how C works, that there's some correspondence between the work performed by the abstract machine and by the actual program as executed (i.e. the "C is portable assembler" fallacy). I don't think we can answer a lot more about why that exact code fragment appears, except that it happens to be an unusual idiom for counting loads and stores on the abstract machine as separate.

answered Dec 30 '18 at 17:59

R.. GitHub STOP HELPING ICE

195,354
31
331
669

One motivation for my question was to check that the "clunky" method does not produce a different effect, maybe for some weird edge case that I cannot envisage. (Both S.len and t have type "int", so there is limited room for edge cases.) You have focused on the mems-counting, which implies that the effect is just plain decrement. OK, good, I can relax about the effects. – Ed Wynn Dec 30 '18 at 18:33
2

The memcount does not need to be based on a correct model to be useful. It can be pragmatically a proxy measure for "work required", in a way that is architecture-/compiler-/system-independent. As such, it is very useful, and I don't know of a better measure. I recently compared memcounts to "CPU time" on some test cases: the relationship was linear. ("CPU time" means even less now than it ever did, so I effectively measured "clock time with no other user processes".) With branching and prefetching and caches, any real measure is highly system-dependent, thus hard to reproduce. – Ed Wynn Dec 30 '18 at 18:47

Why does Knuth use this clunky decrement?

2 Answers2