markov

package
v1.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 18, 2026 License: MIT Imports: 14 Imported by: 0

README

Sarracenia Markov Library

Go Reference Go Version Part of Sarracenia MIT License

A high-performance, persistent Markov chain library for Go, backed by SQLite. Designed for production environments requiring reliable text generation, efficient storage of large datasets, and transactional safety.

Installation

go get github.com/CTAG07/Sarracenia/pkg/markov

Quick Start

package main

import (
	"context"
	"database/sql"
	"fmt"
	"log"
	"strings"

	"github.com/CTAG07/Sarracenia/pkg/markov"
	_ "modernc.org/sqlite" // Or github.com/mattn/go-sqlite3
)

func main() {
	// Open a database connection
	db, _ := sql.Open("sqlite", "file:markov.db?_journal_mode=WAL&_busy_timeout=5000")
	defer db.Close()

	// Initialize Schema (Run once)
	if err := markov.SetupSchema(db); err != nil {
		log.Fatal(err)
	}

	// Create Generator
	gen, _ := markov.NewGenerator(db, markov.NewDefaultTokenizer())
	defer gen.Close()

	ctx := context.Background()

	// Define a Model
	model := markov.ModelInfo{Name: "demo", Order: 2}
	_ = gen.InsertModel(ctx, model)

	// Train
	corpus := "The quick brown fox jumps over the lazy dog."
	if err := gen.Train(ctx, model, strings.NewReader(corpus)); err != nil {
		log.Fatal(err)
	}

	// Generate
	text, _ := gen.Generate(ctx, model, markov.WithMaxLength(10))
	fmt.Println(text)
}

Advanced Usage

Streaming Generation

For real-time applications or large outputs, use GenerateStream to receive tokens via a channel as they are generated.

tokenChan, err := generator.GenerateStream(ctx, model, markov.WithMaxLength(50))
if err != nil {
    log.Fatal(err)
}

for token := range tokenChan {
    if token.EOC {
        break
    }
    fmt.Print(token.Text + generator.Tokenizer().Separator())
}
Model Management

Models can be exported to JSON for backup or portability.

file, err := os.Create("model_backup.json")
if err != nil {
    log.Fatal(err)
}
defer file.Close()

err = generator.ExportModel(ctx, model, file)
if err != nil {
    log.Fatal(err)
}

Concurrency Note

Training Limitation: Due to the write-intensive nature of Markov chain training and the single-writer locking model of SQLite, only one model can be trained at a time.

While read operations (generation) are concurrent and non-blocking in WAL mode, attempting to run multiple Train() jobs simultaneously on the same database file will likely result in SQLITE_BUSY (database locked) errors.

Benchmarks

Benchmarks performed on an Intel Core i9-13905H (Windows 11, Go 1.24.5) using a corpus from the Go standard library.

Generation Performance
Benchmark Time/Op Mem/Op Allocs/Op
Generate/Simple 6.52 ms 721 KB 27,451
GenerateStream/Simple 6.88 ms 695 KB 26,466
Generate/WithTopK 7.19 ms 747 KB 28,499
GenerateStream/WithTopK 7.38 ms 704 KB 26,816
Generate/WithTemp 6.99 ms 908 KB 28,928
Training Performance

Note: (Order #) indicates the number of preceding tokens used as context.

Benchmark Time/Op Processed/Sec Mem/Op Allocs/Op
Train (Order 1) 451 ms 0.56 MB 62.4 MB 1.74M
Train (Order 2) 654 ms 0.43 MB 79.9 MB 2.18M
Train (Order 3) 1.06 s 0.39 MB 88.9 MB 2.39M
Train (Order 4) 1.07 s 0.37MB 91.0 MB 2.44M
Train (Order 5) 1.14 s 0.36MB 92.3 MB 2.46M
VocabularyPrune 2.03 ms N/A 366 KB 6,475
Running Benchmarks
cd pkg/markov
go test -bench . -benchmem

Database Compatibility

This library is optimized for SQLite. It utilizes specific SQL dialect features for performance. Porting to PostgreSQL or MySQL would require modifying the prepared statements in generator.go and train.go.

Documentation

Overview

Package markov provides a robust, high-performance, database-backed toolkit for creating, training, and using Markov chain models in Go.

It supports multiple models within a single SQLite database, offers advanced text generation features like temperature and top-K sampling, and includes a streaming API for real-time applications. The library is designed to be both powerful for production use and easy for experimentation.

For a complete usage example, see the README.md file.

Index

Constants

View Source
const (
	// SOCTokenID is the reserved ID for the Start-Of-Chain token.
	SOCTokenID = 0
	// EOCTokenID is the reserved ID for the End-Of-Chain token.
	EOCTokenID = 1
	// SOCTokenText is the reserved text for the Start-Of-Chain token.
	SOCTokenText = "<SOC>"
	// EOCTokenText is the reserved text for the End-Of-Chain token.
	EOCTokenText = "<EOC>"
)

Variables

This section is empty.

Functions

func SetupSchema

func SetupSchema(db *sql.DB) error

SetupSchema initializes the necessary tables and special vocabulary entries in the provided database. This function should be called once on a new database before any other operations are performed. It is idempotent and safe to call on an already-initialized database.

Types

type ChainToken

type ChainToken struct {
	Id   int
	Freq int
}

ChainToken represents a potential next token in a Markov chain, including its unique ID and its frequency of occurrence after a given prefix.

type DBStats

type DBStats struct {
	Models     []ModelInfo        // A list of models in the database
	Stats      map[int]ModelStats // A mapping of model ids to their stats
	VocabSize  int                // The number of unique tokens in all models' vocabularies
	PrefixSize int                // The number of unique prefixes in all models' chains
}

DBStats holds aggregated statistics for the entire database, including a list of all models and their individual stats.

type DefaultStreamTokenizer

type DefaultStreamTokenizer struct {
	// contains filtered or unexported fields
}

DefaultStreamTokenizer is the default implementation of the StreamTokenizer interface. It uses a bufio.Scanner and regular expressions to read and tokenize a stream.

func (*DefaultStreamTokenizer) Next

func (s *DefaultStreamTokenizer) Next() (*Token, error)

Next returns the next token from the stream. It returns a Token and a nil error on success. When the stream is exhausted, it returns a nil Token and io.EOF. Any other error indicates a problem reading from the underlying stream.

type DefaultTokenizer

type DefaultTokenizer struct {
	// contains filtered or unexported fields
}

DefaultTokenizer is a default implementation of the Tokenizer interface. It uses regular expressions to split text into words and punctuation, and identifies sentence-ending punctuation as End-Of-Chain (EOC) tokens. Its behavior can be customized with functional options.

func NewDefaultTokenizer

func NewDefaultTokenizer(opts ...Option) *DefaultTokenizer

NewDefaultTokenizer creates a new tokenizer with default settings, which can be overridden by providing one or more Option functions.

func (*DefaultTokenizer) EOC

func (t *DefaultTokenizer) EOC(last string) string

EOC Returns the configured end-of-chain replacement string.

func (*DefaultTokenizer) NewStream

func (t *DefaultTokenizer) NewStream(r io.Reader) StreamTokenizer

NewStream Returns the stream processor.

func (*DefaultTokenizer) Separator

func (t *DefaultTokenizer) Separator(_, next string) string

Separator Returns the configured separator string.

type ExportedChain

type ExportedChain struct {
	PrefixID    int `json:"prefix_id"`
	NextTokenID int `json:"next_token_id"`
	Frequency   int `json:"frequency"`
}

ExportedChain is the serializable representation of a single link in a Markov chain, used within an ExportedModel.

type ExportedModel

type ExportedModel struct {
	Name       string          `json:"name"`
	Order      int             `json:"order"`
	Vocabulary map[string]int  `json:"vocabulary"` // token_text -> token_id
	Prefixes   map[string]int  `json:"prefixes"`   // prefix_text -> prefix_id
	Chains     []ExportedChain `json:"chains"`
}

ExportedModel is the serializable representation of a trained model, used for JSON-based import and export.

type GenerateOption

type GenerateOption func(*generateOptions)

GenerateOption is a function that configures generation parameters. It's used as a variadic argument in generation functions like Generate and GenerateStream.

func WithEarlyTermination

func WithEarlyTermination(canEnd bool) GenerateOption

WithEarlyTermination specifies whether the generation process can stop before reaching maxLength if an End-Of-Chain (EOC) token is generated.

func WithMaxLength

func WithMaxLength(n int) GenerateOption

WithMaxLength sets the maximum number of tokens to generate. The generation may stop earlier if an EOC token is chosen and WithEarlyTermination is enabled.

func WithTemperature

func WithTemperature(t float64) GenerateOption

WithTemperature adjusts the randomness of the token selection. A value of 1.0 is standard weighted random selection. Values > 1.0 increase randomness (making less frequent tokens more likely). Values < 1.0 decrease randomness (making more frequent tokens even more likely). A value of 0 or less results in deterministic selection (always choosing the most frequent token).

func WithTopK

func WithTopK(k int) GenerateOption

WithTopK restricts the token selection pool to the top `k` most frequent tokens at each step. A value of 0 disables Top-K sampling.

type Generator

type Generator struct {
	// contains filtered or unexported fields
}

Generator is the main entry point for interacting with the Markov chain library. It holds the database connection, a tokenizer, and prepared SQL statements for efficient database interaction.

func NewGenerator

func NewGenerator(db *sql.DB, tokenizer Tokenizer) (*Generator, error)

NewGenerator creates and returns a new Generator. It takes a database connection and a Tokenizer implementation. It pre-compiles all necessary SQL statements, returning an error if any preparation fails.

func (*Generator) Close

func (g *Generator) Close()

Close releases all prepared SQL statements held by the Generator. It should be called when the Generator is no longer needed to free up database resources.

func (*Generator) ExportModel

func (g *Generator) ExportModel(ctx context.Context, modelInfo ModelInfo, w io.Writer) error

ExportModel serializes a given model into a JSON format and writes it to the provided io.Writer. This is useful for backups or for transferring models.

func (*Generator) Generate

func (g *Generator) Generate(ctx context.Context, model ModelInfo, opts ...GenerateOption) (string, error)

Generate creates a new Markov chain, builds it into a single string, and returns it. It starts from a default initial state of Start-Of-Chain (SOC) tokens. Generation can be customized with GenerateOption functions.

func (*Generator) GenerateFromStream

func (g *Generator) GenerateFromStream(ctx context.Context, model ModelInfo, r io.Reader, opts ...GenerateOption) (string, error)

GenerateFromStream uses the content of an io.Reader as a seed for generation. The provided text is tokenized and used as the initial state of the chain, from which generation continues. An error is returned if a seed token is not found in the model's vocabulary.

func (*Generator) GenerateFromString

func (g *Generator) GenerateFromString(ctx context.Context, model ModelInfo, startText string, opts ...GenerateOption) (string, error)

GenerateFromString is a convenience wrapper around GenerateFromStream that uses a string as the seed. If the string is empty, it behaves identically to Generate.

func (*Generator) GenerateStream

func (g *Generator) GenerateStream(ctx context.Context, model ModelInfo, opts ...GenerateOption) (<-chan Token, error)

GenerateStream creates a new Markov chain and returns a read-only channel of Tokens. This allows for processing the generated text token-by-token, which is useful for real-time applications or when generating very long sequences. The channel will be closed once generation is complete or the context is cancelled.

func (*Generator) GenerateStreamFromStream

func (g *Generator) GenerateStreamFromStream(ctx context.Context, model ModelInfo, r io.Reader, opts ...GenerateOption) (<-chan Token, error)

GenerateStreamFromStream uses an io.Reader as a seed to begin a streaming generation. It tokenizes the content from r and uses it as the initial chain state.

func (*Generator) GenerateStreamFromString

func (g *Generator) GenerateStreamFromString(ctx context.Context, model ModelInfo, startText string, opts ...GenerateOption) (<-chan Token, error)

GenerateStreamFromString is a convenience wrapper around GenerateStreamFromStream that uses a string as the seed. If the string is empty, it behaves identically to GenerateStream.

func (*Generator) GetModelInfo

func (g *Generator) GetModelInfo(ctx context.Context, modelName string) (ModelInfo, error)

GetModelInfo retrieves the metadata for a single model specified by name. If multiple models are needed, GetModelInfos is more efficient.

func (*Generator) GetModelInfos

func (g *Generator) GetModelInfos(ctx context.Context) (map[string]ModelInfo, error)

GetModelInfos retrieves metadata for all models currently in the database, returning them in a map keyed by model name.

func (*Generator) GetNextTokens

func (g *Generator) GetNextTokens(ctx context.Context, model ModelInfo, prefix string) ([]ChainToken, int, error)

GetNextTokens retrieves all possible subsequent tokens for a given prefix key from a specific model. It returns a slice of ChainTokens, the sum of all their frequencies, and any error that occurred. If the prefix is not found, it returns a nil slice and a total frequency of 0.

func (*Generator) GetStats

func (g *Generator) GetStats(ctx context.Context) (*DBStats, error)

GetStats returns a snapshot of statistics for the entire database, including global counts and per-model stats.

func (*Generator) ImportModel

func (g *Generator) ImportModel(ctx context.Context, r io.Reader) error

ImportModel reads a JSON representation of a model from an io.Reader and merges its data into the database. If the model name already exists, the new chain data is merged with the existing data (frequencies are added). If the model does not exist, it is created. The entire operation is transactional and handles re-mapping of vocabulary and prefix IDs.

func (*Generator) InsertModel

func (g *Generator) InsertModel(ctx context.Context, model ModelInfo) error

InsertModel creates a new model entry in the database.

func (*Generator) InsertToken

func (g *Generator) InsertToken(ctx context.Context, model ModelInfo, prefix string, token int) error

InsertToken provides a low-level way to insert or increment a single chain link (`prefix -> token`) for a given model. For most use cases, the high-level Train function is recommended as it is significantly more efficient for bulk data.

func (*Generator) PruneModel

func (g *Generator) PruneModel(ctx context.Context, model ModelInfo, minFreq int) error

PruneModel removes all chain links from a specific model that have a frequency less than or equal to `minFreq`. This is useful for reducing the size of a model by removing rare, and often noisy, transitions.

func (*Generator) RemoveModel

func (g *Generator) RemoveModel(ctx context.Context, model ModelInfo) error

RemoveModel deletes a model and all of its associated chain data from the database. The operation is performed within a transaction.

func (*Generator) SetLogger

func (g *Generator) SetLogger(logger *slog.Logger)

SetLogger sets the logger for the Generator. By default, all logs are discarded. Providing a `log/slog.Logger` will enable logging for training, generation, and other operations.

func (*Generator) SetTokenizer added in v1.1.0

func (g *Generator) SetTokenizer(tokenizer Tokenizer)

func (*Generator) Train

func (g *Generator) Train(ctx context.Context, model ModelInfo, data io.Reader) error

Train processes a stream of text from an io.Reader, tokenizes it, and uses it to train the specified Markov model. The training process is highly optimized, using in-memory caching and database batching to handle large datasets efficiently. The entire operation is performed within a single database transaction to ensure data integrity.

func (*Generator) VocabInt

func (g *Generator) VocabInt(ctx context.Context, id int) (string, error)

VocabInt looks up a token ID in the vocabulary and returns its corresponding text. It returns an error if the ID is not found.

func (*Generator) VocabStr

func (g *Generator) VocabStr(ctx context.Context, token string) (int, error)

VocabStr looks up a token string in the vocabulary and returns its corresponding ID. It returns an error if the token is not found.

func (*Generator) VocabularyPrune

func (g *Generator) VocabularyPrune(ctx context.Context, minFrequency int) error

VocabularyPrune performs a database-wide cleanup, removing tokens from the global vocabulary that are used less than `minFrequency` times across all models. This is a destructive operation that will also delete all chain links and prefixes that rely on the removed tokens. It should be used with caution to reduce the overall database size. Special tokens (<SOC>, <EOC>) are never pruned.

type ModelInfo

type ModelInfo struct {
	Id    int
	Name  string
	Order int
}

ModelInfo holds the essential metadata for a Markov model, including its unique ID, name, and the order of the chain (the number of preceding tokens used to predict the next one).

type ModelStats

type ModelStats struct {
	TotalChains    int // The number of unique prefix->next_token links.
	TotalFrequency int // The sum of frequencies of all links; the total number of trained transitions.
	StartingTokens int // The number of unique tokens that can start a chain.
}

ModelStats holds aggregated statistics for a single Markov model.

type Option

type Option func(*DefaultTokenizer)

Option Is a function that configures a DefaultTokenizer.

func WithEOC added in v1.1.0

func WithEOC(eoc string) Option

WithEOC Sets the string to use in final output for an EOC token. Default: "."

func WithEOCExcRegex added in v1.1.0

func WithEOCExcRegex(eocRegex string) Option

WithEOCExcRegex sets the regex string to use when deciding whether to add an EOC token after the last token.

func WithEOCRegex added in v1.1.0

func WithEOCRegex(eocRegex string) Option

WithEOCRegex sets the regex string to use when deciding whether a token is an EOC token or not. Default: `^[.!?]$`

func WithSeparator

func WithSeparator(sep string) Option

WithSeparator Sets the character used for joining tokens during generation. Default: " "

func WithSeparatorExcRegex added in v1.1.0

func WithSeparatorExcRegex(splitExcRegex string) Option

WithSeparatorExcRegex sets the regex string to use when deciding whether to add a separator before a token.

func WithSeparatorRegex added in v1.1.0

func WithSeparatorRegex(splitRegex string) Option

WithSeparatorRegex sets the regex string to use when splitting input text. Default: `[\w']+|[.,!?;]`

type StreamTokenizer

type StreamTokenizer interface {
	// Next returns the next token from the stream. It returns io.EOF as the
	// error when the stream is fully consumed.
	Next() (*Token, error)
}

StreamTokenizer is an interface for a stateful tokenizer that processes a stream of data, returning one token at a time.

type Token

type Token struct {
	Text string
	EOC  bool
}

Token represents a single tokenized unit of text. It contains the text itself and a boolean flag indicating if it marks the end of a chain (e.g., a sentence).

type Tokenizer

type Tokenizer interface {
	// NewStream returns a stateful StreamTokenizer for processing an io.Reader.
	NewStream(io.Reader) StreamTokenizer
	// Separator returns the string that should be used to join tokens
	// when building a final generated string, using the previous and current
	// tokens.
	Separator(prev, current string) string
	// EOC returns the string representation for an End-Of-Chain token
	// in the final generated output, using the last token in the sequence.
	EOC(last string) string
}

Tokenizer is an interface that defines the contract for splitting input text into tokens. This allows the core generator logic to be independent of the specific tokenization strategy.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL