bfloat16

package
v0.6.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 13, 2025 License: Apache-2.0 Imports: 2 Imported by: 4

Documentation

Overview

Package bfloat16 is a trivial implementation for the bfloat16 type, based on https://github.com/x448/float16 and the pending issue in https://github.com/x448/float16/issues/22

In principal this is a temporary solution. But it will only be deprecated once a definitive solution is provided.

Index

Constants

View Source
const SmallestNonzero = BFloat16(0x0001) // 5.9604645e-08 (effectively 0x1p-14 * 0x1p-10)

SmallestNonzero is the smallest nonzero denormal value for bfloat16 (9.1835e-41). It's the float16 equivalent for math.SmallestNonzeroFloat32 and math.SmallestNonzeroFloat64. For context, math.SmallestNonzeroFloat32 used the formula 1 / 2**(127 - 1 + 23) to produce the smallest denormal value for float32 (1.401298464324817070923729583289916131280e-45). The equivalent formula for float16 is 1 / 2**(15 - 1 + 10). We use Float16(0x0001) to compile as const.

Variables

This section is empty.

Functions

This section is empty.

Types

type BFloat16

type BFloat16 uint16

BFloat16 (brain floating point)[1][2] floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing.

bfloat16 and patents:

func FromBits

func FromBits(uint16 uint16) BFloat16

FromBits convert an uint16 to a BFloat16.

func FromFloat32

func FromFloat32(x float32) BFloat16

FromFloat32 converts a float32 to a BFloat16.

func FromFloat64

func FromFloat64(x float64) BFloat16

FromFloat64 converts a float32 to a BFloat16.

func Inf

func Inf(sign int) BFloat16

Inf returns a BFloat16 with an infinity value with the specified sign. A sign >= returns positive infinity. A sign < 0 returns negative infinity.

func (BFloat16) Bits

func (f BFloat16) Bits() uint16

Bits convert BFloat16 to an uint16.

func (BFloat16) Float32

func (f BFloat16) Float32() float32

func (BFloat16) String

func (f BFloat16) String() string

String implements fmt.Stringer, and prints a float representation of the BFloat16.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL