tokenmonster

package

v0.0.0-...-8a5ca26 Latest Latest Go to latest Published: Sep 1, 2024 License: MIT Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/Ravi-MSRI/tokenmonster

Links

Open Source Insights

README ¶

Click here for the complete documentation on pkg.go.dev.

Basic Usage

import "github.com/alasdairforsythe/tokenmonster/go"

func example() {

	vocab, err := tokenmonster.Load(vocabfilename)
	if err != nil {
		panic(err)
	}

	tokens, missing, err := vocab.Tokenize(text)
	if err != nil {
		panic(err)
	}
	
	decoder := vocab.NewDecoder()
	decoded_text := decoder.Decode(tokens)

}

missing is the number of bytes for which there were no tokens.

text must be a slice of bytes. If you are using UTF-16 encoding, that slice of bytes should be already UTF-16 encoded.

decoded_text will be also a slice of bytes in the charset encoding. If you are using UTF-8 encoding you can convert it to a string with string().

When using vocab.Tokenize(text) please note that if the vocabulary uses any normalizations other than NFD, the normalizations may be applied to the underlying text data. Therefore please pass a copy if you don't want the underlying data to be modified. This applies only to the Go package (the Python library always uses a copy.)

.

Documentation ¶

Constants ¶

View Source

const (
	DOES_NOT_EXIST = 16777215
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Decoder ¶

type Decoder struct {
	// contains filtered or unexported fields
}

A decoder object for sequential decoding. Use the NewDecoder function of the Vocab struct.

func (*Decoder) Decode ¶

func (d *Decoder) Decode(tokens []uint32) []byte

Decodes tokens IDs back into bytes.

func (*Decoder) DecodeSerialized ¶

func (d *Decoder) DecodeSerialized(b []byte, encodingLength uint8, buffer []byte) []byte

Decodes tokens from a serialized bytes slice. `encodingLength` must be one of: 0, 2, 3, 4. If you enter `encodingLength` 0 then it will determine the encoding length from the vocabulary size. `buffer` is optional, you can send it `nil` and it will allocate a new slice.

func (*Decoder) Deserialize ¶

func (d *Decoder) Deserialize(data []byte, encodingLength uint8) []uint32

Deserializes tokens encoded in a bytes stream into a slice of uint32 token IDs. `encodingLength` must be one of: 0, 2, 3, 4. If you enter `encodingLength` 0 then it will determine the encoding length from the vocabulary size.

func (*Decoder) Flush ¶

func (d *Decoder) Flush() []byte

Flushes the remainder from the Decoder instance These will any trailing incomplete UTF-8 sequences or capcode encoding marks

type Info ¶

type Info struct {
	Id           uint32
	Token        []byte
	TokenDecoded []byte
	Type         uint8 // 0 = regular, 1 = character, 2 = special, 3 = unk
	Score        float32
}

Info struct allows access to detailed information about each token from TokensDetailed(). Token is the token still encoded with capcode. TokenDecoded is the decoded form of the token, however the token can be modified by a previous token in a sequence so this cannot be used for decoding. Type is 0 for regular tokens, 1 for character tokens, 2 for special tokens, 3 for UNK token. The Score is the percentage of the training dataset that this token covered and is used for sorting the tokens by their importance.

type Vocab ¶

type Vocab struct {
	// contains filtered or unexported fields
}

The main struct for the vocabulary

func Load ¶

func Load(filename string) (*Vocab, error)

Load the vocabulary from a local file.

func NewVocab ¶

func NewVocab(tokens [][]byte, specialTokens [][]byte, charset uint8, normalization string, usingCapcode uint8, include256bytes bool, include128bytes bool, includeUTF8bytes bool, includeASCIIbytes bool, includeExtendedBytes bool, excludeOtherBytes bool) (*Vocab, error)

NewVocab makes a fresh vocabulary from a custom list of tokens. If you generated your vocabulary with TokenMonster tools, you will not be using this function but instead using `Load`.

func NewVocabFromYAML ¶

func NewVocabFromYAML(yml []byte) (*Vocab, error)

NewVocabFromYAML makes a fresh vocabulary from a YAML file.

func (*Vocab) AddSpecialToken ¶

func (vocab *Vocab) AddSpecialToken(token []byte)

Adds a single special token to the vocabulary. A special token is special because only this token is allowed to tokenize text containing this. If any regular tokens contain your special token within them, they will be deleted. Modifying a vocabulary does not change existing token IDs. All normalization and capcode is applied automatically.

func (*Vocab) AddSpecialTokens ¶

func (vocab *Vocab) AddSpecialTokens(specialTokens [][]byte, size int)

Add multiple special tokens and optionally resize. Enter `size` 0 to not resize. Modifying a vocabulary does not change existing token IDs.

func (*Vocab) AddToken ¶

func (vocab *Vocab) AddToken(token []byte)

Adds a single token to the vocabulary. Modifying a vocabulary does not change existing token IDs. All normalization and capcode is applied automatically.

func (*Vocab) AddTokens ¶

func (vocab *Vocab) AddTokens(addTokens [][]byte, specialTokens [][]byte, size int)

Adds multiple regular and optionally special tokens. You can use `size` to resize the vocabulary to keep it at a specific size. Enter `size` 0 to not resize. Modifying a vocabulary does not change existing token IDs.

func (*Vocab) Capcode ¶

func (vocab *Vocab) Capcode() uint8

The capcode level. 0 = disabled, 1 = deleteToken only, 2 = fully enabled.

func (*Vocab) Charset ¶

func (vocab *Vocab) Charset() uint8

The charset code for the vocabulary. 0 = None, 1 = UTF-8, 2 = UTF-16.

func (*Vocab) Count ¶

func (vocab *Vocab) Count(data []byte) (int, int, error)

Tokenizes but returns the number of tokens instead of the tokens.

func (*Vocab) Decode ¶

func (vocab *Vocab) Decode(tokens []uint32) []byte

Decodes tokens backs into bytes. If you are decoding a stream of tokens individually or in batches, instead of all at once, you should use the Decode method for the Decoder struct instead.

func (*Vocab) DecodeSerialized ¶

func (vocab *Vocab) DecodeSerialized(b []byte, encodingLength uint8, buffer []byte) []byte

Decodes tokens from a serialized bytes slice. `encodingLength` must be one of: 0, 2, 3, 4. If you enter `encodingLength` 0 then it will determine the encoding length from the vocabulary size. `buffer` is optional, you can send it `nil` and it will allocate a new slice. If you are decoding a stream of tokens individually or in batches, instead of all at once, you should use the Decode method for the Decoder struct instead.

func (*Vocab) DeleteToken ¶

func (vocab *Vocab) DeleteToken(token []byte)

Deletes a single token from the vocabulary. Tokens to delete can be capcoded encoded or not, it will look for both. Modifying a vocabulary does not change existing token IDs.

func (*Vocab) DeleteTokenID ¶

func (vocab *Vocab) DeleteTokenID(id uint32)

Deletes a single token from the vocabulary by specifying the ID. Modifying a vocabulary does not change existing token IDs.

func (*Vocab) DeleteTokens ¶

func (vocab *Vocab) DeleteTokens(deleteTokens [][]byte, size int)

Delete multiple tokens and optionally resize. Tokens to delete can be capcoded encoded or not, it will look for both. Enter `size` 0 to not resize. Modifying a vocabulary does not change existing token IDs.

func (*Vocab) Denormalize ¶

func (vocab *Vocab) Denormalize(b []byte) []byte

Decodes capcode from the bytes.

func (*Vocab) Deserialize ¶

func (vocab *Vocab) Deserialize(data []byte, encodingLength uint8) (tokens []uint32)

func (*Vocab) DisableUnkToken ¶

func (vocab *Vocab) DisableUnkToken()

Disables the UNK token. Without an UNK token, a character that has no token to represent it will be ignored.

func (*Vocab) EnableUnkToken ¶

func (vocab *Vocab) EnableUnkToken() bool

Enables the UNK token. Returns true if successful, returns false if an UNK token is not applicable to this vocabulary (all bytes have tokens). If enabled, UNK token will be inserted for every character for which there is no token. You can resize after this if you want to keep the vocabulary sized as it was before, otherwise it will be 1 larger.

func (*Vocab) ExportYAML ¶

func (vocab *Vocab) ExportYAML(writer io.Writer, orderByScore bool)

Exports the vocabulary to a human-readable YAML file. It writes to an io.Writer. You can import from YAML with NewVocabFromYAML().

func (*Vocab) HasUnk ¶

func (vocab *Vocab) HasUnk() bool

Returns true if the vocabulary is using the UNK token. If used, the UNK token ID is used whenever a character being tokenized doesn't exist in the vocabulary.

func (*Vocab) HighestTokenID ¶

func (vocab *Vocab) HighestTokenID() int

Returns the value of the highest token ID.

func (*Vocab) IdToToken ¶

func (vocab *Vocab) IdToToken(id uint32) []byte

Returns the encoded token for the token ID, or nil if it does not exist.

func (*Vocab) Len ¶

func (vocab *Vocab) Len() int

Returns number of tokens in the vocabulary, inluding UNK token if it is used.

func (*Vocab) MaxTokenLength ¶

func (vocab *Vocab) MaxTokenLength() int

The length of the longest (encoded) token in the vocabulary. This can be lower than that chosen during training if none of the longer tokens were chosen.

func (*Vocab) Mode ¶

func (vocab *Vocab) Mode() uint8

The original filter for training the vocabulary. 0 = unfiltered, 1 = clean, 2 = balanced, 3 = consistent, 4 = strict, 5 = not trained with trainvocab.

func (*Vocab) ModifyVocabulary ¶

func (vocab *Vocab) ModifyVocabulary(addTokens [][]byte, specialTokens [][]byte, deleteTokens [][]byte, size int, resetTokenIds bool)

Add regular & special tokens, delete tokens and resize, all in one. Modifying a vocabulary does not change existing token IDs. Pass resetTokenIds = true to ensure there are no gaps in the token IDs.

func (*Vocab) ModifyVocabularyFromYAML ¶

func (vocab *Vocab) ModifyVocabularyFromYAML(yml []byte, size int, resetTokenIds bool)

Add regular & special tokens, delete tokens and resize, all in one. Modifying a vocabulary does not change existing token IDs. Pass resetTokenIds = true to ensure there are no gaps in the token IDs.

func (*Vocab) NewDecoder ¶

func (vocab *Vocab) NewDecoder() *Decoder

Creates a new Decoder instance. This is for decoding tokens in a sequence when they are to be decoded individually or in batches. If you are decoding all in one go, you can use the Vocab's Decode method.

func (*Vocab) Normalization ¶

func (vocab *Vocab) Normalization() string

The type of normalization applied automatically when tokenizing. Returns a string.

func (*Vocab) NormalizationCode ¶

func (vocab *Vocab) NormalizationCode() uint8

The type of normalization applied automatically when tokenizing. Returns a uint8.

func (*Vocab) Normalize ¶

func (vocab *Vocab) Normalize(data []byte) ([]byte, error)

Applies all normalizations to the bytes, including capcode and NFD.

func (*Vocab) NumDeletedTokens ¶

func (vocab *Vocab) NumDeletedTokens() int

The number of tokens deleted from the vocabulary. These can be restored by resizing the vocabulary to be be larger.

func (*Vocab) NumSingleByteTokens ¶

func (vocab *Vocab) NumSingleByteTokens() int

The number of single byte tokens in the vocabulary.

func (*Vocab) NumSpecialTokens ¶

func (vocab *Vocab) NumSpecialTokens() int

Returns the number of special tokens in the vocabulary.

func (*Vocab) PrivateGenerateVocab ¶

func (vocab *Vocab) PrivateGenerateVocab(yamlData []byte, tokens [][]byte, scores []float32, addTokens [][]byte, deleteTokens [][]byte, specialTokens [][]byte, specialTokensEncoded [][]byte, charset uint8, normalizeString string, usingCapcode uint8, level uint8, reserve uint8, resize int, resetTokenIds bool) error

Don't use this function, it's exported because it's used by the exportvocab tool.

func (*Vocab) ResetTokenIds ¶

func (vocab *Vocab) ResetTokenIds(token []byte)

Resets all the IDs of the tokens to be assigned alphabetically, starting from 0, with no gaps.

func (*Vocab) Resize ¶

func (vocab *Vocab) Resize(size int)

Resize the vocabulary by deleting the worst scoring tokens. You can also resize the vocabulary to be larger if any tokens have previously been deleted. Modifying a vocabulary does not change existing token IDs.

func (Vocab) Save ¶

func (vocab Vocab) Save(outputFilename string) error

Save the vocabulary to local file.

func (Vocab) SaveWithMapping ¶

func (vocab Vocab) SaveWithMapping(outputFilename string, mapping []uint32) error

func (*Vocab) SingleByteTokens ¶

func (vocab *Vocab) SingleByteTokens() []byte

A slice that contains all the single byte tokens in the vocabulary. Note that this is returned as only a slice of bytes, not a slice of slice of bytes.

func (*Vocab) SingleBytesTrainingCode ¶

func (vocab *Vocab) SingleBytesTrainingCode() uint8

Returns the uint8 code corresponding to the training parameters for single byte tokens.

func (*Vocab) SpecialTokens ¶

func (vocab *Vocab) SpecialTokens() []Info

Returns the token IDs and the corresponding tokens of only the. Set `decode` to false to receive the decoded form of the tokens.

func (*Vocab) TokenToId ¶

func (vocab *Vocab) TokenToId(b []byte) (uint32, bool)

Returns the ID of the token from bytes. This only works for capcode encoded tokens. Apply `Normalize` to the bytes first to use this with decoded tokens.

func (*Vocab) Tokenize ¶

func (vocab *Vocab) Tokenize(data []byte) ([]uint32, int, error)

Tokenizes text from bytes slice to token IDs. The 2nd returned value (int) is the number of characters for which there were no tokens and were replaced with Unk token.

func (*Vocab) TokenizeToSerialized ¶

func (vocab *Vocab) TokenizeToSerialized(data []byte, encodingLength uint8, buffer []byte) ([]byte, uint8, int, error)

Tokenizes directly into serialized bytes with either 16-bit, 24-bit or 32-bit encoded unsigned integers depending on the vocabulary size. Set encodingLength to 0 for it to be chosen automatically, or set `encodingLength` to 2, 3 or 4. The 2rd return value is the encodingLength that was used, and the 3rd is the number of characters for which there were no tokens. `buffer` is an optional reusable buffer, you can send nil.

func (*Vocab) Tokens ¶

func (vocab *Vocab) Tokens() [][]byte

Returns a slice of all tokens in the vocabulary (excluding UNK), in their encoded capcode form.

func (*Vocab) TokensDetailed ¶

func (vocab *Vocab) TokensDetailed() []Info

Returns a slice of Info struct where the index is the Token ID

func (*Vocab) Unk ¶

func (vocab *Vocab) Unk() uint32

Returns the ID of the Unk token. It will return 16777215 if there is no Unk token. You can use HasUnk() to first check if there is an UNK token.

type YamlItem ¶

type YamlItem struct {
	Encoded bool    `yaml:"encoded,omitempty"`
	Token   string  `yaml:",omitempty"`
	Id      *int    `yaml:"id,omitempty"`
	Score   float32 `yaml:"score,omitempty"`
}

type YamlVocab ¶

type YamlVocab struct {
	Charset              string     `yaml:"charset,omitempty"`
	Normalization        string     `yaml:"normalization,omitempty"`
	Capcode              int        `yaml:"capcode,omitempty"`
	TrainingParam        *int       `yaml:"training-param,omitempty"`
	ResetTokenIds        bool       `yaml:"reset-token-ids,omitempty"`
	Include256Bytes      bool       `yaml:"include-256-bytes,omitempty"`
	Include128Bytes      bool       `yaml:"include-128-bytes,omitempty"`
	IncludeUtf8Bytes     bool       `yaml:"include-utf8-bytes,omitempty"`
	IncludeAsciiBytes    bool       `yaml:"include-ascii-bytes,omitempty"`
	IncludeExtendedBytes bool       `yaml:"include-extended-bytes,omitempty"`
	ExcludeOtherBytes    bool       `yaml:"exclude-other-bytes,omitempty"`
	Unk                  bool       `yaml:"unk,omitempty"`
	UnkId                *int       `yaml:"unk-id,omitempty"`
	Regular              []YamlItem `yaml:"tokens,omitempty"`
	Special              []YamlItem `yaml:"special,omitempty"`
	Delete               []YamlItem `yaml:"delete,omitempty"`
}

Source Files ¶

View all Source files

tokenmonster.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL