bfloat16

package

v0.6.4 Latest Latest Go to latest Published: Apr 13, 2025 License: Apache-2.0 Imports: 2 Imported by: 4

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gomlx/gopjrt

Links

Open Source Insights

Documentation ¶

Overview ¶

Package bfloat16 is a trivial implementation for the bfloat16 type, based on https://github.com/x448/float16 and the pending issue in https://github.com/x448/float16/issues/22

In principal this is a temporary solution. But it will only be deprecated once a definitive solution is provided.

Constants ¶

View Source

const SmallestNonzero = BFloat16(0x0001) // 5.9604645e-08 (effectively 0x1p-14 * 0x1p-10)

SmallestNonzero is the smallest nonzero denormal value for bfloat16 (9.1835e-41). It's the float16 equivalent for math.SmallestNonzeroFloat32 and math.SmallestNonzeroFloat64. For context, math.SmallestNonzeroFloat32 used the formula 1 / 2**(127 - 1 + 23) to produce the smallest denormal value for float32 (1.401298464324817070923729583289916131280e-45). The equivalent formula for float16 is 1 / 2**(15 - 1 + 10). We use Float16(0x0001) to compile as const.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type BFloat16 ¶

type BFloat16 uint16

BFloat16 (brain floating point)[1][2] floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing.

bfloat16 and patents: