X64 ASM: a 256-bit MOV to register possible?

Question

I have a 256-bit data structure (a 4 int64 array) that I need to load and perform (integer) addition on in X64 assembler. Instead of doing 4 MOVs, I'm trying to load all 4 with a single instruction. Is this possible? I thought I could do it with MOVDQA, but that apparently will only load into the XMM registers, and from there my only option is to do a floating-point add, which is not what I need.

Edit: Current routine is:

    mov rax, [rcx]
mov r8, 8[rcx]
mov r9, 16[rcx]
mov r10, 24[rcx]
add rax, [rdx]
adc r8, 8[rdx]
adc r9, 16[rdx]
adc r10, 24[rdx]
jc  adjust_modular
mov [rcx], rax
mov 8[rcx], r8
mov 16[rcx], r9
mov 24[rcx], r10
adjust_modular: (....)
ret

https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX — Hans Passant, Nov 19 '20 at 23:10
https://www.felixcloutier.com/x86/movdqu:vmovdqu8:vmovdqu16:vmovdqu32:vmovdqu64 shows the `vmovdqu ymm, mem` form. See https://www.felixcloutier.com/x86/paddb:paddw:paddd:paddq for `vpaddq`. IDK why you think only FP add would be possible in XMM or YMM registers. If you didn't find this in the manual, have a look at Agner Fog's optimizing assembly guide https://agner.org/optimize/ — Peter Cordes, Nov 19 '20 at 23:28
If you want the result to end up in 4 general purpose registers, doing 4 loads is pretty much your best option. — fuz, Nov 19 '20 at 23:40
"I'm trying to load all 4 with a single instruction." Why? Just for efficiency, or are you trying to get some sort of atomicity? — Nate Eldredge, Nov 20 '20 at 00:47
@peterCordes; I'll check those articles and see if I can get it to work, thx. @NateEldredge, I'm just trying to make the routine as efficient as possible; I don't need atomicity. — Endymio, Nov 20 '20 at 02:27
@Endymio It is unlikely that the load itself is the bottleneck. If you post the routine in question, I might be able to give you some better optimisation tips. — fuz, Nov 20 '20 at 09:14
@fuz, thanks. I added the current routine; it's a very basic bignum add carry chain. (I left out the adjustment for modular arithmetic as its not really relevant atm). My last assembly experience was in the days of the 386, so I may have made some very basic errors. — Endymio, Nov 20 '20 at 15:01
@Endymio If you are doing big number arithmetic with carry, there is likely no way to do this any more efficiently than you already do. As you have a carry chain, there is a linear dependency that cannot be eliminated. SIMD will not help for this. — fuz, Nov 20 '20 at 15:10
@fuz, I was resigned to doing 4 adds. It just seemed like the MOVs could be condensed down more than that. If not, and you post that as an answer, I'll accept it. — Endymio, Nov 20 '20 at 16:17
If you do one `vmovdqu` load into a YMM register, then you need 4 ALU instructions to get each 64-bit element back into an integer register. That's not better. — Peter Cordes, Nov 20 '20 at 20:06

X64 ASM: a 256-bit MOV to register possible?

0 Answers0