Documentation
¶
Overview ¶
Package markov provides a robust, high-performance, database-backed toolkit for creating, training, and using Markov chain models in Go.
It supports multiple models within a single SQLite database, offers advanced text generation features like temperature and top-K sampling, and includes a streaming API for real-time applications. The library is designed to be both powerful for production use and easy for experimentation.
For a complete usage example, see the README.md file.
Index ¶
- Constants
- func SetupSchema(db *sql.DB) error
- type ChainToken
- type DBStats
- type DefaultStreamTokenizer
- type DefaultTokenizer
- type ExportedChain
- type ExportedModel
- type GenerateOption
- type Generator
- func (g *Generator) Close()
- func (g *Generator) ExportModel(ctx context.Context, modelInfo ModelInfo, w io.Writer) error
- func (g *Generator) Generate(ctx context.Context, model ModelInfo, opts ...GenerateOption) (string, error)
- func (g *Generator) GenerateFromStream(ctx context.Context, model ModelInfo, r io.Reader, opts ...GenerateOption) (string, error)
- func (g *Generator) GenerateFromString(ctx context.Context, model ModelInfo, startText string, opts ...GenerateOption) (string, error)
- func (g *Generator) GenerateStream(ctx context.Context, model ModelInfo, opts ...GenerateOption) (<-chan Token, error)
- func (g *Generator) GenerateStreamFromStream(ctx context.Context, model ModelInfo, r io.Reader, opts ...GenerateOption) (<-chan Token, error)
- func (g *Generator) GenerateStreamFromString(ctx context.Context, model ModelInfo, startText string, opts ...GenerateOption) (<-chan Token, error)
- func (g *Generator) GetModelInfo(ctx context.Context, modelName string) (ModelInfo, error)
- func (g *Generator) GetModelInfos(ctx context.Context) (map[string]ModelInfo, error)
- func (g *Generator) GetNextTokens(ctx context.Context, model ModelInfo, prefix string) ([]ChainToken, int, error)
- func (g *Generator) GetStats(ctx context.Context) (*DBStats, error)
- func (g *Generator) ImportModel(ctx context.Context, r io.Reader) error
- func (g *Generator) InsertModel(ctx context.Context, model ModelInfo) error
- func (g *Generator) InsertToken(ctx context.Context, model ModelInfo, prefix string, token int) error
- func (g *Generator) PruneModel(ctx context.Context, model ModelInfo, minFreq int) error
- func (g *Generator) RemoveModel(ctx context.Context, model ModelInfo) error
- func (g *Generator) SetLogger(logger *slog.Logger)
- func (g *Generator) SetTokenizer(tokenizer Tokenizer)
- func (g *Generator) Train(ctx context.Context, model ModelInfo, data io.Reader) error
- func (g *Generator) VocabInt(ctx context.Context, id int) (string, error)
- func (g *Generator) VocabStr(ctx context.Context, token string) (int, error)
- func (g *Generator) VocabularyPrune(ctx context.Context, minFrequency int) error
- type ModelInfo
- type ModelStats
- type Option
- type StreamTokenizer
- type Token
- type Tokenizer
Constants ¶
const ( // SOCTokenID is the reserved ID for the Start-Of-Chain token. SOCTokenID = 0 // EOCTokenID is the reserved ID for the End-Of-Chain token. EOCTokenID = 1 // SOCTokenText is the reserved text for the Start-Of-Chain token. SOCTokenText = "<SOC>" // EOCTokenText is the reserved text for the End-Of-Chain token. EOCTokenText = "<EOC>" )
Variables ¶
This section is empty.
Functions ¶
func SetupSchema ¶
SetupSchema initializes the necessary tables and special vocabulary entries in the provided database. This function should be called once on a new database before any other operations are performed. It is idempotent and safe to call on an already-initialized database.
Types ¶
type ChainToken ¶
ChainToken represents a potential next token in a Markov chain, including its unique ID and its frequency of occurrence after a given prefix.
type DBStats ¶
type DBStats struct {
Models []ModelInfo // A list of models in the database
Stats map[int]ModelStats // A mapping of model ids to their stats
VocabSize int // The number of unique tokens in all models' vocabularies
PrefixSize int // The number of unique prefixes in all models' chains
}
DBStats holds aggregated statistics for the entire database, including a list of all models and their individual stats.
type DefaultStreamTokenizer ¶
type DefaultStreamTokenizer struct {
// contains filtered or unexported fields
}
DefaultStreamTokenizer is the default implementation of the StreamTokenizer interface. It uses a bufio.Scanner and regular expressions to read and tokenize a stream.
func (*DefaultStreamTokenizer) Next ¶
func (s *DefaultStreamTokenizer) Next() (*Token, error)
Next returns the next token from the stream. It returns a Token and a nil error on success. When the stream is exhausted, it returns a nil Token and io.EOF. Any other error indicates a problem reading from the underlying stream.
type DefaultTokenizer ¶
type DefaultTokenizer struct {
// contains filtered or unexported fields
}
DefaultTokenizer is a default implementation of the Tokenizer interface. It uses regular expressions to split text into words and punctuation, and identifies sentence-ending punctuation as End-Of-Chain (EOC) tokens. Its behavior can be customized with functional options.
func NewDefaultTokenizer ¶
func NewDefaultTokenizer(opts ...Option) *DefaultTokenizer
NewDefaultTokenizer creates a new tokenizer with default settings, which can be overridden by providing one or more Option functions.
func (*DefaultTokenizer) EOC ¶
func (t *DefaultTokenizer) EOC(last string) string
EOC Returns the configured end-of-chain replacement string.
func (*DefaultTokenizer) NewStream ¶
func (t *DefaultTokenizer) NewStream(r io.Reader) StreamTokenizer
NewStream Returns the stream processor.
func (*DefaultTokenizer) Separator ¶
func (t *DefaultTokenizer) Separator(_, next string) string
Separator Returns the configured separator string.
type ExportedChain ¶
type ExportedChain struct {
PrefixID int `json:"prefix_id"`
NextTokenID int `json:"next_token_id"`
Frequency int `json:"frequency"`
}
ExportedChain is the serializable representation of a single link in a Markov chain, used within an ExportedModel.
type ExportedModel ¶
type ExportedModel struct {
Name string `json:"name"`
Order int `json:"order"`
Vocabulary map[string]int `json:"vocabulary"` // token_text -> token_id
Prefixes map[string]int `json:"prefixes"` // prefix_text -> prefix_id
Chains []ExportedChain `json:"chains"`
}
ExportedModel is the serializable representation of a trained model, used for JSON-based import and export.
type GenerateOption ¶
type GenerateOption func(*generateOptions)
GenerateOption is a function that configures generation parameters. It's used as a variadic argument in generation functions like Generate and GenerateStream.
func WithEarlyTermination ¶
func WithEarlyTermination(canEnd bool) GenerateOption
WithEarlyTermination specifies whether the generation process can stop before reaching maxLength if an End-Of-Chain (EOC) token is generated.
func WithMaxLength ¶
func WithMaxLength(n int) GenerateOption
WithMaxLength sets the maximum number of tokens to generate. The generation may stop earlier if an EOC token is chosen and WithEarlyTermination is enabled.
func WithTemperature ¶
func WithTemperature(t float64) GenerateOption
WithTemperature adjusts the randomness of the token selection. A value of 1.0 is standard weighted random selection. Values > 1.0 increase randomness (making less frequent tokens more likely). Values < 1.0 decrease randomness (making more frequent tokens even more likely). A value of 0 or less results in deterministic selection (always choosing the most frequent token).
func WithTopK ¶
func WithTopK(k int) GenerateOption
WithTopK restricts the token selection pool to the top `k` most frequent tokens at each step. A value of 0 disables Top-K sampling.
type Generator ¶
type Generator struct {
// contains filtered or unexported fields
}
Generator is the main entry point for interacting with the Markov chain library. It holds the database connection, a tokenizer, and prepared SQL statements for efficient database interaction.
func NewGenerator ¶
NewGenerator creates and returns a new Generator. It takes a database connection and a Tokenizer implementation. It pre-compiles all necessary SQL statements, returning an error if any preparation fails.
func (*Generator) Close ¶
func (g *Generator) Close()
Close releases all prepared SQL statements held by the Generator. It should be called when the Generator is no longer needed to free up database resources.
func (*Generator) ExportModel ¶
ExportModel serializes a given model into a JSON format and writes it to the provided io.Writer. This is useful for backups or for transferring models.
func (*Generator) Generate ¶
func (g *Generator) Generate(ctx context.Context, model ModelInfo, opts ...GenerateOption) (string, error)
Generate creates a new Markov chain, builds it into a single string, and returns it. It starts from a default initial state of Start-Of-Chain (SOC) tokens. Generation can be customized with GenerateOption functions.
func (*Generator) GenerateFromStream ¶
func (g *Generator) GenerateFromStream(ctx context.Context, model ModelInfo, r io.Reader, opts ...GenerateOption) (string, error)
GenerateFromStream uses the content of an io.Reader as a seed for generation. The provided text is tokenized and used as the initial state of the chain, from which generation continues. An error is returned if a seed token is not found in the model's vocabulary.
func (*Generator) GenerateFromString ¶
func (g *Generator) GenerateFromString(ctx context.Context, model ModelInfo, startText string, opts ...GenerateOption) (string, error)
GenerateFromString is a convenience wrapper around GenerateFromStream that uses a string as the seed. If the string is empty, it behaves identically to Generate.
func (*Generator) GenerateStream ¶
func (g *Generator) GenerateStream(ctx context.Context, model ModelInfo, opts ...GenerateOption) (<-chan Token, error)
GenerateStream creates a new Markov chain and returns a read-only channel of Tokens. This allows for processing the generated text token-by-token, which is useful for real-time applications or when generating very long sequences. The channel will be closed once generation is complete or the context is cancelled.
func (*Generator) GenerateStreamFromStream ¶
func (g *Generator) GenerateStreamFromStream(ctx context.Context, model ModelInfo, r io.Reader, opts ...GenerateOption) (<-chan Token, error)
GenerateStreamFromStream uses an io.Reader as a seed to begin a streaming generation. It tokenizes the content from r and uses it as the initial chain state.
func (*Generator) GenerateStreamFromString ¶
func (g *Generator) GenerateStreamFromString(ctx context.Context, model ModelInfo, startText string, opts ...GenerateOption) (<-chan Token, error)
GenerateStreamFromString is a convenience wrapper around GenerateStreamFromStream that uses a string as the seed. If the string is empty, it behaves identically to GenerateStream.
func (*Generator) GetModelInfo ¶
GetModelInfo retrieves the metadata for a single model specified by name. If multiple models are needed, GetModelInfos is more efficient.
func (*Generator) GetModelInfos ¶
GetModelInfos retrieves metadata for all models currently in the database, returning them in a map keyed by model name.
func (*Generator) GetNextTokens ¶
func (g *Generator) GetNextTokens(ctx context.Context, model ModelInfo, prefix string) ([]ChainToken, int, error)
GetNextTokens retrieves all possible subsequent tokens for a given prefix key from a specific model. It returns a slice of ChainTokens, the sum of all their frequencies, and any error that occurred. If the prefix is not found, it returns a nil slice and a total frequency of 0.
func (*Generator) GetStats ¶
GetStats returns a snapshot of statistics for the entire database, including global counts and per-model stats.
func (*Generator) ImportModel ¶
ImportModel reads a JSON representation of a model from an io.Reader and merges its data into the database. If the model name already exists, the new chain data is merged with the existing data (frequencies are added). If the model does not exist, it is created. The entire operation is transactional and handles re-mapping of vocabulary and prefix IDs.
func (*Generator) InsertModel ¶
InsertModel creates a new model entry in the database.
func (*Generator) InsertToken ¶
func (g *Generator) InsertToken(ctx context.Context, model ModelInfo, prefix string, token int) error
InsertToken provides a low-level way to insert or increment a single chain link (`prefix -> token`) for a given model. For most use cases, the high-level Train function is recommended as it is significantly more efficient for bulk data.
func (*Generator) PruneModel ¶
PruneModel removes all chain links from a specific model that have a frequency less than or equal to `minFreq`. This is useful for reducing the size of a model by removing rare, and often noisy, transitions.
func (*Generator) RemoveModel ¶
RemoveModel deletes a model and all of its associated chain data from the database. The operation is performed within a transaction.
func (*Generator) SetLogger ¶
SetLogger sets the logger for the Generator. By default, all logs are discarded. Providing a `log/slog.Logger` will enable logging for training, generation, and other operations.
func (*Generator) SetTokenizer ¶ added in v1.1.0
func (*Generator) Train ¶
Train processes a stream of text from an io.Reader, tokenizes it, and uses it to train the specified Markov model. The training process is highly optimized, using in-memory caching and database batching to handle large datasets efficiently. The entire operation is performed within a single database transaction to ensure data integrity.
func (*Generator) VocabInt ¶
VocabInt looks up a token ID in the vocabulary and returns its corresponding text. It returns an error if the ID is not found.
func (*Generator) VocabStr ¶
VocabStr looks up a token string in the vocabulary and returns its corresponding ID. It returns an error if the token is not found.
func (*Generator) VocabularyPrune ¶
VocabularyPrune performs a database-wide cleanup, removing tokens from the global vocabulary that are used less than `minFrequency` times across all models. This is a destructive operation that will also delete all chain links and prefixes that rely on the removed tokens. It should be used with caution to reduce the overall database size. Special tokens (<SOC>, <EOC>) are never pruned.
type ModelInfo ¶
ModelInfo holds the essential metadata for a Markov model, including its unique ID, name, and the order of the chain (the number of preceding tokens used to predict the next one).
type ModelStats ¶
type ModelStats struct {
TotalChains int // The number of unique prefix->next_token links.
TotalFrequency int // The sum of frequencies of all links; the total number of trained transitions.
StartingTokens int // The number of unique tokens that can start a chain.
}
ModelStats holds aggregated statistics for a single Markov model.
type Option ¶
type Option func(*DefaultTokenizer)
Option Is a function that configures a DefaultTokenizer.
func WithEOC ¶ added in v1.1.0
WithEOC Sets the string to use in final output for an EOC token. Default: "."
func WithEOCExcRegex ¶ added in v1.1.0
WithEOCExcRegex sets the regex string to use when deciding whether to add an EOC token after the last token.
func WithEOCRegex ¶ added in v1.1.0
WithEOCRegex sets the regex string to use when deciding whether a token is an EOC token or not. Default: `^[.!?]$`
func WithSeparator ¶
WithSeparator Sets the character used for joining tokens during generation. Default: " "
func WithSeparatorExcRegex ¶ added in v1.1.0
WithSeparatorExcRegex sets the regex string to use when deciding whether to add a separator before a token.
func WithSeparatorRegex ¶ added in v1.1.0
WithSeparatorRegex sets the regex string to use when splitting input text. Default: `[\w']+|[.,!?;]`
type StreamTokenizer ¶
type StreamTokenizer interface {
// Next returns the next token from the stream. It returns io.EOF as the
// error when the stream is fully consumed.
Next() (*Token, error)
}
StreamTokenizer is an interface for a stateful tokenizer that processes a stream of data, returning one token at a time.
type Token ¶
Token represents a single tokenized unit of text. It contains the text itself and a boolean flag indicating if it marks the end of a chain (e.g., a sentence).
type Tokenizer ¶
type Tokenizer interface {
// NewStream returns a stateful StreamTokenizer for processing an io.Reader.
NewStream(io.Reader) StreamTokenizer
// Separator returns the string that should be used to join tokens
// when building a final generated string, using the previous and current
// tokens.
Separator(prev, current string) string
// EOC returns the string representation for an End-Of-Chain token
// in the final generated output, using the last token in the sequence.
EOC(last string) string
}
Tokenizer is an interface that defines the contract for splitting input text into tokens. This allows the core generator logic to be independent of the specific tokenization strategy.