1

I'm having trouble finding the NEON instrinsic I need. I have a 128-bit value as a int64x2_t, and I need to copy the low 64-bits to the high 64-bits. I also need to copy the high 64-bits to the low 64-bits on occasion.

NEON has a lane dup, but it takes int64x1_t and returns a int64x1_t.

int64x1_t   vdup_lane_s64(int64x1_t vec, __constrange(0,0) int lane);

The range also seems off since it seems like I should be able to select 1 or 2. (Maybe this is a misunderstanding on my part).

How do I copy the low 64-bits to the high 64-bits in a int64x2_t?


I'm not using the (high >> x) | (low << x) pattern as suggested below. First, its undefined behavior in C/C++ when x is 0. Second, the value should be in a NEON SIMD register, so I don't want to accidentally round trip it. Third, GCC is not generating the code I hoped for, so I don't want to give GCC the opportunity to get slower.

Community
  • 1
  • 1
jww
  • 83,594
  • 69
  • 338
  • 732
  • 1
    How would you do it using any standard types? Lets say you want to swap the high and low 16 bits of a 32 bit integer, how would you do that? What bitwise operations would you use? Are there similar intrinsic functions for `int64x2_t` and `int64x1_t`? – Some programmer dude May 10 '16 at 13:11
  • `int64x2_t vdupq_n_s64(int64_t value); // VMOV d0,r0,r0` – user3528438 May 10 '16 at 13:15
  • @JoachimPileborg - With bitops in C with `int64x2_t` (effectively a pointer) and a function like `vdup_lane_s64`, I would perform the cast to squash the compiler error. Its not clear to me if its safe to do here. Hence the reason I'm looking the "NEON way" of doing things. – jww May 10 '16 at 13:24
  • You would do `(high >> x) | (low << x)` regardless of types. – Lundin May 10 '16 at 13:45
  • 1
    @Lundin - if my variable is in a SIMD register, then I should probably use the intrinsic to ensure its not moved from the SIMD coprocessor to general purpose registers for the shift. Its the reason I'm looking for something lane related. – jww May 10 '16 at 14:04
  • Sounds like you need to write the code in assembler then, rather than C. – Lundin May 10 '16 at 14:06
  • `uint64x2_t vmovq_n_u64 (uint64_t)`+`uint64x1_t vget_low_u64 (uint64x2_t)` ? – EOF May 10 '16 at 14:09

1 Answers1

1

There are (at least) two ways you can write it.

int64x2_t f(int64x1_t v)
{
    return vdupq_lane_s64(v, 0);
    // or
    // return vcombine_s64(v, v); // poor code with GCC
}

The input of vdupq_lane is a 64 bit vector, but the result is a 128 bit vector.

Charles Baylis
  • 801
  • 7
  • 8
  • ***"poor code with GCC..."*** - exactly! Clang does a much better job with NEON intrinsics. – jww May 10 '16 at 15:41
  • If you have simple examples like this that easily reproduce and give poor code generation a bug report against GCC would be helpful. – James Greenhalgh May 11 '16 at 05:15