Documentation
¶
Overview ¶
Package bfloat16 is a trivial implementation for the bfloat16 type, based on https://github.com/x448/float16 and the pending issue in https://github.com/x448/float16/issues/22
In principal this is a temporary solution. But it will only be deprecated once a definitive solution is provided.
Index ¶
Constants ¶
const SmallestNonzero = BFloat16(0x0001) // 5.9604645e-08 (effectively 0x1p-14 * 0x1p-10)
SmallestNonzero is the smallest nonzero denormal value for bfloat16 (9.1835e-41). It's the float16 equivalent for math.SmallestNonzeroFloat32 and math.SmallestNonzeroFloat64. For context, math.SmallestNonzeroFloat32 used the formula 1 / 2**(127 - 1 + 23) to produce the smallest denormal value for float32 (1.401298464324817070923729583289916131280e-45). The equivalent formula for float16 is 1 / 2**(15 - 1 + 10). We use Float16(0x0001) to compile as const.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type BFloat16 ¶
type BFloat16 uint16
BFloat16 (brain floating point)[1][2] floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing.
bfloat16 and patents:
- https://en.wikipedia.org/wiki/Tensor_Processing_Unit#Lawsuit
- https://www.reddit.com/r/MachineLearning/comments/193zpyi/d_does_patent_lawsuit_against_googles_tpu_imperil/
func FromFloat32 ¶
FromFloat32 converts a float32 to a BFloat16.
func FromFloat64 ¶
FromFloat64 converts a float32 to a BFloat16.