8

Is it possible to perform half-precision floating-point arithmetic on Intel chips?

I know how to load/store/convert half-precision floating-point numbers [1] but I do not know how to add/multiply them without converting to single-precision floating-point numbers.

[1] https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
Kadir
  • 823
  • 1
  • 12
  • 22

2 Answers2

19

related: https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture - has some info about BFloat16 in Cooper Lake and Sapphire Rapids, and some non-Intel info.


Is it possible to perform half-precision floating-point arithmetic on Intel chips?

Yes, apparently the on-chip GPU in Skylake and later has hardware support for FP16 and FP64, as well as FP32. With new enough drivers you can use it via OpenCL.

On earlier chips you get about the same throughput for FP16 vs. FP32 (probably just converting on the fly for nearly free), but on SKL / KBL chips you get about double the throughput of FP32 for GPGPU Mandelbrot (note the log-scale on the Mpix/s axis of the chart in that link).

The gain in FP64 (double) performance was huge, too.


But on the IA cores (Intel-Architecture) no; even with AVX512 there's no hardware support for anything but converting them to single-precision. This saves memory bandwidth and can certainly give you a speedup if your code bottlenecks on memory. But it doesn't gain in peak FLOPS for code that's not bottlenecked on memory.

You could of course implement software floating point, possibly even in SIMD registers, so technically the answer is still "yes" to the question you asked, but it won't be faster than using the F16C VCVTPH2PS / VCVTPS2PH instructions + packed-single vmulps / vfmadd132ps HW support.

Use HW-supported SIMD conversion to/from float / __m256 in x86 code to trade extra ALU conversion work for reduced memory bandwidth and cache footprint. But if cache-blocking (e.g. for well-tuned dense matmul) or very high computational intensity means you're not memory bottlenecked, then just use float and save on ALU operations.


Upcoming: bfloat16 (Brain Float) and AVX512 BF16

A new 16-bit FP format with the same exponent range as IEEE binary32 has been developed for neural network use-cases. Compared to IEEE binary16 like x86 F16C conversion instructions use, it has much less significand precision, but apparently neural network code cares more about dynamic range from a large exponent range. This allows bfloat hardware not to even bother supporting subnormals.

Some upcoming Intel x86 CPU cores are will have HW support this format. The main use-case is still dedicated neural network accelerators (Nervana) and GPGPU type devices, but HW-supported conversion at least is very useful.

https://en.wikichip.org/wiki/brain_floating-point_format has more details, specifically that Cooper Lake Xeon and Core X CPUs are expected to support AVX512 BF16.

I haven't seen it mentioned for Ice Lake (Sunny Cove microarch). That could go either way, I wouldn't care to guess.

Intel® Architecture Instruction Set Extensions and Future Features Programming Reference revision -036 in April 2019 added details about BF16, including that it's slated for "Future, Cooper Lake". Once it's released, the documentation for the instructions will move to the main vol.2 ISA ref manual (and the pdf->HTML scrape at https://www.felixcloutier.com/x86/index.html).

https://github.com/HJLebbink/asm-dude/wiki has instructions from vol.2 and the future-extensions manual, so you can already find it there.

There are only 3 instructions: conversion to/from float, and a BF16 multiply + pairwise-accumulate into float. (First horizontal step of a dot-product.) So AVX512 BF16 does finally provide true computation for 16-bit floating point, but only in this very limited form that converts the result to float.

They also ignore MXCSR, always using the default rounding mode and DAZ/FTZ, and not setting any exception flags.

The other two don't support memory fault-suppression (when using masking with a memory source operand). Presumably because the masking is per destination element, and there are a different number of source elements. Conversion to BF16 apparently can suppress memory faults, because the same mask can apply to the 32-bit source elements as the 16-bit destination elements.

  • VCVTNE2PS2BF16 [xyz]mm1{k1}{z}, [xyz]mm2, [xyz]mm3/m512/m32bcst
    ConVerT (No Exceptions) 2 registers of Packed Single 2(to) BF16.
    _m512bh _mm512_cvtne2ps_pbh (__m512, __m512);

  • VDPBF16PS [xyz]mm1{k1}{z}, [xyz]mm2, [xyz]mm3/m512/m32bcst
    Dot Product of BF16 Pairs Accumulated into Packed Single Precision
    __m512 _mm512_dpbf16_ps(__m512, __m512bh, __m512bh); (Notice that even the unmasked version has a 3rd input for the destination accumulator, like an FMA).

    # the key part of the Operation section:
    t ← src2.dword[ i ]  (or  src.dword[0] for a broadcast memory source)
    srcdest.fp32[ i ] += make_fp32(src1.bfloat16[2*i+1]) * make_fp32(t.bfloat[1])
    srcdest.fp32[ i ] += make_fp32(src1.bfloat16[2*i+0]) * make_fp32(t.bfloat[0])
    

So we still don't get native 16-bit FP math that you can use for arbitrary things while keeping your data in 16-bit format for 32 elements per vector. Only FMA into 32-bit accumulators.


BTW, there are other real-number formats that aren't based on the IEEE-754 structure of fixed-width fields for sign/exponent/significand. One that's gaining popularity is Posit. https://en.wikipedia.org/wiki/Unum_(number_format), Beating Floating Point at its Own Game: Posit Arithmetic, and https://posithub.org/about

Instead of spending the whole significand coding space on NaNs, they use it for tapered / gradual overflow, supporting larger range. (And removing NaN simplifies the HW). IEEE floats only support gradual underflow (with subnormals), with hard overflow to +-Inf. (Which is usually an error/problem in real numerical simulations, not much different from NaN.)

The Posit encoding is sort of a variable width exponent, leaving more precision near 1.0. The goal is to allow using 32-bit or 16-bit precision in more cases (instead of 64 or 32) while still getting useful results for scientific computing / HPC, such as climate modeling. Double the work per SIMD vector, and half the memory bandwidth.

There have been some paper designs for Posit FPU hardware, but it's still early days yet and I think only FPGA implementations have really been built. Some Intel CPUs will come with onboard FPGAs (or maybe that's already a thing).

As of mid-2019 I haven't read about any Posit execution units as part of a commercial CPU design, and google didn't find anything.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • Zooming into the Mandelbrot set with half-precision is not going to go very deep. Using perturbation the limitation moves from the significant to the exponent. The exponent of half-precision is 2^-14 so you could zoom to about 10^-5 at twice the speed of single precision which can zoom to about 10^-38 with perturbation. Double to 10^-324 and using x87 long double down to 10^−4951. That's the only case I know of where x87 is still useful. Double-double and quad precision do not help because they don't change the exponent precision. – Z boson Apr 30 '18 at 09:02
  • @Zboson: GPU mandelbrot is presumably not about zooming or being useful, but rather just a well-known and simple problem with very high computational intensity / low memory bandwidth. (And a data dependency chain which could limit ILP). That page had some other benchmarks, too, but I like Mandelbrot. – Peter Cordes Apr 30 '18 at 09:06
  • Peter, just in case you know, is there a performance benefit in loading/storing half floats to/from AVX units, while still processing in full float precision, assuming large matrix multiplication, as the most common example? In the first order approximation, this seems beneficial, as it essentially halves cache use and memory badnwidth. If you feel that it's worth a full answer in itself, not a short update, I'd be happy to post a separate Q. – kkm Mar 25 '19 at 21:18
  • @kkm: With proper cache-blocking (aka loop tiling), dense matmul isn't memory bound. It's ALU bound, and spending uops on f16 conversion would take cycles on the FMA ports. (And / or front-end bandwidth would be a problem, too, if you can't use a memory-source operand for FMA). In a badly optimized matmul that loads input data into L2 or L1d cache more than once, f16 might be an improvement. But with O(n^3) ALU work over O(n^2) data, it's generally possible to keep memory bandwidth down to O(n^2). – Peter Cordes Mar 25 '19 at 21:22
  • Thank you! I admit most of this stuff flies way over my head (this is why I am using MKL for blas :) ), but the takeaway is the answer to my question is no. – kkm Mar 26 '19 at 01:31
  • 2
    @PeterCordes: Interesting. The [Anandtech article](https://www.anandtech.com/show/14179/intel-manual-updates-bfloat16-for-cooper-lake-xeon-scalable-only), and the [Intel document](https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf), suggest that BF16 only has conversion instructions and dot products. – wim Jun 28 '19 at 09:30
  • @wim: Thanks, fixed. I hadn't looked up the manual yet and was being overly optimistic. :/ – Peter Cordes Jun 29 '19 at 22:05
  • I'm not familiar with neural network algorithms, but probably it is possible to express the majority of such computations in terms of `VDPBF16PS`, without needing too much adds or muls. Nice addition. – wim Jun 30 '19 at 08:13
1

If you are using all cores I would think that in many cases you are still memory bandwidth bound and half precision floating points would be a win.

user3290232
  • 21
  • 1
  • 5