markov

package

v1.2.2 Latest Latest Go to latest Published: Jan 18, 2026 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/CTAG07/Sarracenia

Links

Open Source Insights

README ¶

Sarracenia Markov Library

A high-performance, persistent Markov chain library for Go, backed by SQLite. Designed for production environments requiring reliable text generation, efficient storage of large datasets, and transactional safety.

Installation

go get github.com/CTAG07/Sarracenia/pkg/markov

Quick Start

package main

import (
	"context"
	"database/sql"
	"fmt"
	"log"
	"strings"

	"github.com/CTAG07/Sarracenia/pkg/markov"
	_ "modernc.org/sqlite" // Or github.com/mattn/go-sqlite3
)

func main() {
	// Open a database connection
	db, _ := sql.Open("sqlite", "file:markov.db?_journal_mode=WAL&_busy_timeout=5000")
	defer db.Close()

	// Initialize Schema (Run once)
	if err := markov.SetupSchema(db); err != nil {
		log.Fatal(err)
	}

	// Create Generator
	gen, _ := markov.NewGenerator(db, markov.NewDefaultTokenizer())
	defer gen.Close()

	ctx := context.Background()

	// Define a Model
	model := markov.ModelInfo{Name: "demo", Order: 2}
	_ = gen.InsertModel(ctx, model)

	// Train
	corpus := "The quick brown fox jumps over the lazy dog."
	if err := gen.Train(ctx, model, strings.NewReader(corpus)); err != nil {
		log.Fatal(err)
	}

	// Generate
	text, _ := gen.Generate(ctx, model, markov.WithMaxLength(10))
	fmt.Println(text)
}

Advanced Usage

Streaming Generation

For real-time applications or large outputs, use GenerateStream to receive tokens via a channel as they are generated.

tokenChan, err := generator.GenerateStream(ctx, model, markov.WithMaxLength(50))
if err != nil {
    log.Fatal(err)
}

for token := range tokenChan {
    if token.EOC {
        break
    }
    fmt.Print(token.Text + generator.Tokenizer().Separator())
}

Model Management

Models can be exported to JSON for backup or portability.

file, err := os.Create("model_backup.json")
if err != nil {
    log.Fatal(err)
}
defer file.Close()

err = generator.ExportModel(ctx, model, file)
if err != nil {
    log.Fatal(err)
}

Concurrency Note

Training Limitation: Due to the write-intensive nature of Markov chain training and the single-writer locking model of SQLite, only one model can be trained at a time.

While read operations (generation) are concurrent and non-blocking in WAL mode, attempting to run multiple Train() jobs simultaneously on the same database file will likely result in SQLITE_BUSY (database locked) errors.

Benchmarks

Benchmarks performed on an Intel Core i9-13905H (Windows 11, Go 1.24.5) using a corpus from the Go standard library.

Generation Performance

Benchmark	Time/Op	Mem/Op	Allocs/Op
`Generate/Simple`	6.52 ms	721 KB	27,451
`GenerateStream/Simple`	6.88 ms	695 KB	26,466
`Generate/WithTopK`	7.19 ms	747 KB	28,499
`GenerateStream/WithTopK`	7.38 ms	704 KB	26,816
`Generate/WithTemp`	6.99 ms	908 KB	28,928

Training Performance

Note: (Order #) indicates the number of preceding tokens used as context.

Benchmark	Time/Op	Processed/Sec	Mem/Op	Allocs/Op
`Train (Order 1)`	451 ms	0.56 MB	62.4 MB	1.74M
`Train (Order 2)`	654 ms	0.43 MB	79.9 MB	2.18M
`Train (Order 3)`	1.06 s	0.39 MB	88.9 MB	2.39M
`Train (Order 4)`	1.07 s	0.37MB	91.0 MB	2.44M
`Train (Order 5)`	1.14 s	0.36MB	92.3 MB	2.46M
`VocabularyPrune`	2.03 ms	N/A	366 KB	6,475

Running Benchmarks

cd pkg/markov
go test -bench . -benchmem

Database Compatibility

This library is optimized for SQLite. It utilizes specific SQL dialect features for performance. Porting to PostgreSQL or MySQL would require modifying the prepared statements in generator.go and train.go.

Documentation ¶

Overview ¶

Package markov provides a robust, high-performance, database-backed toolkit for creating, training, and using Markov chain models in Go.

It supports multiple models within a single SQLite database, offers advanced text generation features like temperature and top-K sampling, and includes a streaming API for real-time applications. The library is designed to be both powerful for production use and easy for experimentation.

For a complete usage example, see the README.md file.

Index ¶

Constants
func SetupSchema(db *sql.DB) error
type ChainToken
type DBStats
type DefaultStreamTokenizer
- func (s *DefaultStreamTokenizer) Next() (*Token, error)
type DefaultTokenizer
- func NewDefaultTokenizer(opts ...Option) *DefaultTokenizer
- func (t *DefaultTokenizer) EOC(last string) string
- func (t *DefaultTokenizer) NewStream(r io.Reader) StreamTokenizer
- func (t *DefaultTokenizer) Separator(_, next string) string
type ExportedChain
type ExportedModel
type GenerateOption
- func WithEarlyTermination(canEnd bool) GenerateOption
- func WithMaxLength(n int) GenerateOption
- func WithTemperature(t float64) GenerateOption
- func WithTopK(k int) GenerateOption
type Generator
- func NewGenerator(db *sql.DB, tokenizer Tokenizer) (*Generator, error)
- func (g *Generator) Close()
- func (g *Generator) ExportModel(ctx context.Context, modelInfo ModelInfo, w io.Writer) error
- func (g *Generator) Generate(ctx context.Context, model ModelInfo, opts ...GenerateOption) (string, error)
- func (g *Generator) GenerateFromStream(ctx context.Context, model ModelInfo, r io.Reader, opts ...GenerateOption) (string, error)
- func (g *Generator) GenerateFromString(ctx context.Context, model ModelInfo, startText string, opts ...GenerateOption) (string, error)
- func (g *Generator) GenerateStream(ctx context.Context, model ModelInfo, opts ...GenerateOption) (<-chan Token, error)
- func (g *Generator) GenerateStreamFromStream(ctx context.Context, model ModelInfo, r io.Reader, opts ...GenerateOption) (<-chan Token, error)
- func (g *Generator) GenerateStreamFromString(ctx context.Context, model ModelInfo, startText string, opts ...GenerateOption) (<-chan Token, error)
- func (g *Generator) GetModelInfo(ctx context.Context, modelName string) (ModelInfo, error)
- func (g *Generator) GetModelInfos(ctx context.Context) (map[string]ModelInfo, error)
- func (g *Generator) GetNextTokens(ctx context.Context, model ModelInfo, prefix string) ([]ChainToken, int, error)
- func (g *Generator) GetStats(ctx context.Context) (*DBStats, error)
- func (g *Generator) ImportModel(ctx context.Context, r io.Reader) error
- func (g *Generator) InsertModel(ctx context.Context, model ModelInfo) error
- func (g *Generator) InsertToken(ctx context.Context, model ModelInfo, prefix string, token int) error
- func (g *Generator) PruneModel(ctx context.Context, model ModelInfo, minFreq int) error
- func (g *Generator) RemoveModel(ctx context.Context, model ModelInfo) error
- func (g *Generator) SetLogger(logger *slog.Logger)
- func (g *Generator) SetTokenizer(tokenizer Tokenizer)
- func (g *Generator) Train(ctx context.Context, model ModelInfo, data io.Reader) error
- func (g *Generator) VocabInt(ctx context.Context, id int) (string, error)
- func (g *Generator) VocabStr(ctx context.Context, token string) (int, error)
- func (g *Generator) VocabularyPrune(ctx context.Context, minFrequency int) error
type ModelInfo
type ModelStats
type Option
- func WithEOC(eoc string) Option
- func WithEOCExcRegex(eocRegex string) Option
- func WithEOCRegex(eocRegex string) Option
- func WithSeparator(sep string) Option
- func WithSeparatorExcRegex(splitExcRegex string) Option
- func WithSeparatorRegex(splitRegex string) Option
type StreamTokenizer
type Token
type Tokenizer

Constants ¶

View Source

const (
	// SOCTokenID is the reserved ID for the Start-Of-Chain token.
	SOCTokenID = 0
	// EOCTokenID is the reserved ID for the End-Of-Chain token.
	EOCTokenID = 1
	// SOCTokenText is the reserved text for the Start-Of-Chain token.
	SOCTokenText = "<SOC>"
	// EOCTokenText is the reserved text for the End-Of-Chain token.
	EOCTokenText = "<EOC>"
)

Variables ¶

This section is empty.

Functions ¶

func SetupSchema ¶

func SetupSchema(db *sql.DB) error

SetupSchema initializes the necessary tables and special vocabulary entries in the provided database. This function should be called once on a new database before any other operations are performed. It is idempotent and safe to call on an already-initialized database.

Types ¶

type ChainToken ¶

type ChainToken struct {
	Id   int
	Freq int
}

ChainToken represents a potential next token in a Markov chain, including its unique ID and its frequency of occurrence after a given prefix.

type DBStats ¶

type DBStats struct {
	Models     []ModelInfo        // A list of models in the database
	Stats      map[int]ModelStats // A mapping of model ids to their stats
	VocabSize  int                // The number of unique tokens in all models' vocabularies
	PrefixSize int                // The number of unique prefixes in all models' chains
}

DBStats holds aggregated statistics for the entire database, including a list of all models and their individual stats.

type DefaultStreamTokenizer ¶

type DefaultStreamTokenizer struct {
	// contains filtered or unexported fields
}

DefaultStreamTokenizer is the default implementation of the StreamTokenizer interface. It uses a bufio.Scanner and regular expressions to read and tokenize a stream.

func (*DefaultStreamTokenizer) Next ¶

func (s *DefaultStreamTokenizer) Next() (*Token, error)

Next returns the next token from the stream. It returns a Token and a nil error on success. When the stream is exhausted, it returns a nil Token and io.EOF. Any other error indicates a problem reading from the underlying stream.

type DefaultTokenizer ¶

type DefaultTokenizer struct {
	// contains filtered or unexported fields
}

DefaultTokenizer is a default implementation of the Tokenizer interface. It uses regular expressions to split text into words and punctuation, and identifies sentence-ending punctuation as End-Of-Chain (EOC) tokens. Its behavior can be customized with functional options.

func NewDefaultTokenizer ¶

func NewDefaultTokenizer(opts ...Option) *DefaultTokenizer

NewDefaultTokenizer creates a new tokenizer with default settings, which can be overridden by providing one or more Option functions.

func (*DefaultTokenizer) EOC ¶

func (t *DefaultTokenizer) EOC(last string) string

EOC Returns the configured end-of-chain replacement string.

func (*DefaultTokenizer) NewStream ¶

func (t *DefaultTokenizer) NewStream(r io.Reader) StreamTokenizer

NewStream Returns the stream processor.

func (*DefaultTokenizer) Separator ¶

func (t *DefaultTokenizer) Separator(_, next string) string

Separator Returns the configured separator string.

type ExportedChain ¶

type ExportedChain struct {
	PrefixID    int `json:"prefix_id"`
	NextTokenID int `json:"next_token_id"`
	Frequency   int `json:"frequency"`
}

ExportedChain is the serializable representation of a single link in a Markov chain, used within an ExportedModel.

type ExportedModel ¶

type ExportedModel struct {
	Name       string          `json:"name"`
	Order      int             `json:"order"`
	Vocabulary map[string]int  `json:"vocabulary"` // token_text -> token_id
	Prefixes   map[string]int  `json:"prefixes"`   // prefix_text -> prefix_id
	Chains     []ExportedChain `json:"chains"`
}

ExportedModel is the serializable representation of a trained model, used for JSON-based import and export.

type GenerateOption ¶

type GenerateOption func(*generateOptions)

GenerateOption is a function that configures generation parameters. It's used as a variadic argument in generation functions like Generate and GenerateStream.

func WithEarlyTermination ¶

func WithEarlyTermination(canEnd bool) GenerateOption

WithEarlyTermination specifies whether the generation process can stop before reaching maxLength if an End-Of-Chain (EOC) token is generated.

func WithMaxLength ¶

func WithMaxLength(n int) GenerateOption

WithMaxLength sets the maximum number of tokens to generate. The generation may stop earlier if an EOC token is chosen and WithEarlyTermination is enabled.

func WithTemperature ¶

func WithTemperature(t float64) GenerateOption

WithTemperature adjusts the randomness of the token selection. A value of 1.0 is standard weighted random selection. Values > 1.0 increase randomness (making less frequent tokens more likely). Values < 1.0 decrease randomness (making more frequent tokens even more likely). A value of 0 or less results in deterministic selection (always choosing the most frequent token).

func WithTopK ¶

func WithTopK(k int) GenerateOption

WithTopK restricts the token selection pool to the top `k` most frequent tokens at each step. A value of 0 disables Top-K sampling.

type Generator ¶

type Generator struct {
	// contains filtered or unexported fields
}

Generator is the main entry point for interacting with the Markov chain library. It holds the database connection, a tokenizer, and prepared SQL statements for efficient database interaction.

func NewGenerator ¶

func NewGenerator(db *sql.DB, tokenizer Tokenizer) (*Generator, error)

NewGenerator creates and returns a new Generator. It takes a database connection and a Tokenizer implementation. It pre-compiles all necessary SQL statements, returning an error if any preparation fails.

func (*Generator) Close ¶

func (g *Generator) Close()

Close releases all prepared SQL statements held by the Generator. It should be called when the Generator is no longer needed to free up database resources.

func (*Generator) ExportModel ¶

func (g *Generator) ExportModel(ctx context.Context, modelInfo ModelInfo, w io.Writer) error

ExportModel serializes a given model into a JSON format and writes it to the provided io.Writer. This is useful for backups or for transferring models.

func (*Generator) Generate ¶

func (g *Generator) Generate(ctx context.Context, model ModelInfo, opts ...GenerateOption) (string, error)

Generate creates a new Markov chain, builds it into a single string, and returns it. It starts from a default initial state of Start-Of-Chain (SOC) tokens. Generation can be customized with GenerateOption functions.

func (*Generator) GenerateFromStream ¶

func (g *Generator) GenerateFromStream(ctx context.Context, model ModelInfo, r io.Reader, opts ...GenerateOption) (string, error)

GenerateFromStream uses the content of an io.Reader as a seed for generation. The provided text is tokenized and used as the initial state of the chain, from which generation continues. An error is returned if a seed token is not found in the model's vocabulary.

func (*Generator) GenerateFromString ¶

func (g *Generator) GenerateFromString(ctx context.Context, model ModelInfo, startText string, opts ...GenerateOption) (string, error)

GenerateFromString is a convenience wrapper around GenerateFromStream that uses a string as the seed. If the string is empty, it behaves identically to Generate.

func (*Generator) GenerateStream ¶

func (g *Generator) GenerateStream(ctx context.Context, model ModelInfo, opts ...GenerateOption) (<-chan Token, error)

GenerateStream creates a new Markov chain and returns a read-only channel of Tokens. This allows for processing the generated text token-by-token, which is useful for real-time applications or when generating very long sequences. The channel will be closed once generation is complete or the context is cancelled.

func (*Generator) GenerateStreamFromStream ¶

func (g *Generator) GenerateStreamFromStream(ctx context.Context, model ModelInfo, r io.Reader, opts ...GenerateOption) (<-chan Token, error)

GenerateStreamFromStream uses an io.Reader as a seed to begin a streaming generation. It tokenizes the content from r and uses it as the initial chain state.

func (*Generator) GenerateStreamFromString ¶

func (g *Generator) GenerateStreamFromString(ctx context.Context, model ModelInfo, startText string, opts ...GenerateOption) (<-chan Token, error)

GenerateStreamFromString is a convenience wrapper around GenerateStreamFromStream that uses a string as the seed. If the string is empty, it behaves identically to GenerateStream.

func (*Generator) GetModelInfo ¶

func (g *Generator) GetModelInfo(ctx context.Context, modelName string) (ModelInfo, error)

GetModelInfo retrieves the metadata for a single model specified by name. If multiple models are needed, GetModelInfos is more efficient.

func (*Generator) GetModelInfos ¶

func (g *Generator) GetModelInfos(ctx context.Context) (map[string]ModelInfo, error)

GetModelInfos retrieves metadata for all models currently in the database, returning them in a map keyed by model name.

func (*Generator) GetNextTokens ¶

func (g *Generator) GetNextTokens(ctx context.Context, model ModelInfo, prefix string) ([]ChainToken, int, error)

GetNextTokens retrieves all possible subsequent tokens for a given prefix key from a specific model. It returns a slice of ChainTokens, the sum of all their frequencies, and any error that occurred. If the prefix is not found, it returns a nil slice and a total frequency of 0.

func (*Generator) GetStats ¶

func (g *Generator) GetStats(ctx context.Context) (*DBStats, error)

GetStats returns a snapshot of statistics for the entire database, including global counts and per-model stats.

func (*Generator) ImportModel ¶

func (g *Generator) ImportModel(ctx context.Context, r io.Reader) error

ImportModel reads a JSON representation of a model from an io.Reader and merges its data into the database. If the model name already exists, the new chain data is merged with the existing data (frequencies are added). If the model does not exist, it is created. The entire operation is transactional and handles re-mapping of vocabulary and prefix IDs.

func (*Generator) InsertModel ¶

func (g *Generator) InsertModel(ctx context.Context, model ModelInfo) error

InsertModel creates a new model entry in the database.

func (*Generator) InsertToken ¶

func (g *Generator) InsertToken(ctx context.Context, model ModelInfo, prefix string, token int) error

InsertToken provides a low-level way to insert or increment a single chain link (`prefix -> token`) for a given model. For most use cases, the high-level Train function is recommended as it is significantly more efficient for bulk data.

func (*Generator) PruneModel ¶

func (g *Generator) PruneModel(ctx context.Context, model ModelInfo, minFreq int) error

PruneModel removes all chain links from a specific model that have a frequency less than or equal to `minFreq`. This is useful for reducing the size of a model by removing rare, and often noisy, transitions.

func (*Generator) RemoveModel ¶

func (g *Generator) RemoveModel(ctx context.Context, model ModelInfo) error

RemoveModel deletes a model and all of its associated chain data from the database. The operation is performed within a transaction.

func (*Generator) SetLogger ¶

func (g *Generator) SetLogger(logger *slog.Logger)

SetLogger sets the logger for the Generator. By default, all logs are discarded. Providing a `log/slog.Logger` will enable logging for training, generation, and other operations.

func (*Generator) SetTokenizer ¶ added in v1.1.0

func (g *Generator) SetTokenizer(tokenizer Tokenizer)

func (*Generator) Train ¶

func (g *Generator) Train(ctx context.Context, model ModelInfo, data io.Reader) error

Train processes a stream of text from an io.Reader, tokenizes it, and uses it to train the specified Markov model. The training process is highly optimized, using in-memory caching and database batching to handle large datasets efficiently. The entire operation is performed within a single database transaction to ensure data integrity.

func (*Generator) VocabInt ¶

func (g *Generator) VocabInt(ctx context.Context, id int) (string, error)

VocabInt looks up a token ID in the vocabulary and returns its corresponding text. It returns an error if the ID is not found.

func (*Generator) VocabStr ¶

func (g *Generator) VocabStr(ctx context.Context, token string) (int, error)

VocabStr looks up a token string in the vocabulary and returns its corresponding ID. It returns an error if the token is not found.

func (*Generator) VocabularyPrune ¶

func (g *Generator) VocabularyPrune(ctx context.Context, minFrequency int) error

VocabularyPrune performs a database-wide cleanup, removing tokens from the global vocabulary that are used less than `minFrequency` times across all models. This is a destructive operation that will also delete all chain links and prefixes that rely on the removed tokens. It should be used with caution to reduce the overall database size. Special tokens (<SOC>, <EOC>) are never pruned.

type ModelInfo ¶

type ModelInfo struct {
	Id    int
	Name  string
	Order int
}

ModelInfo holds the essential metadata for a Markov model, including its unique ID, name, and the order of the chain (the number of preceding tokens used to predict the next one).

type ModelStats ¶

type ModelStats struct {
	TotalChains    int // The number of unique prefix->next_token links.
	TotalFrequency int // The sum of frequencies of all links; the total number of trained transitions.
	StartingTokens int // The number of unique tokens that can start a chain.
}

ModelStats holds aggregated statistics for a single Markov model.

type Option ¶

type Option func(*DefaultTokenizer)

Option Is a function that configures a DefaultTokenizer.

func WithEOC ¶ added in v1.1.0

func WithEOC(eoc string) Option

WithEOC Sets the string to use in final output for an EOC token. Default: "."

func WithEOCExcRegex ¶ added in v1.1.0

func WithEOCExcRegex(eocRegex string) Option

WithEOCExcRegex sets the regex string to use when deciding whether to add an EOC token after the last token.

func WithEOCRegex ¶ added in v1.1.0

func WithEOCRegex(eocRegex string) Option

WithEOCRegex sets the regex string to use when deciding whether a token is an EOC token or not. Default: `^[.!?]$`

func WithSeparator ¶

func WithSeparator(sep string) Option

WithSeparator Sets the character used for joining tokens during generation. Default: " "

func WithSeparatorExcRegex ¶ added in v1.1.0

func WithSeparatorExcRegex(splitExcRegex string) Option

WithSeparatorExcRegex sets the regex string to use when deciding whether to add a separator before a token.

func WithSeparatorRegex ¶ added in v1.1.0

func WithSeparatorRegex(splitRegex string) Option

WithSeparatorRegex sets the regex string to use when splitting input text. Default: `[\w']+|[.,!?;]`

type StreamTokenizer ¶

type StreamTokenizer interface {
	// Next returns the next token from the stream. It returns io.EOF as the
	// error when the stream is fully consumed.
	Next() (*Token, error)
}

StreamTokenizer is an interface for a stateful tokenizer that processes a stream of data, returning one token at a time.

type Token ¶

type Token struct {
	Text string
	EOC  bool
}

Token represents a single tokenized unit of text. It contains the text itself and a boolean flag indicating if it marks the end of a chain (e.g., a sentence).

type Tokenizer ¶

type Tokenizer interface {
	// NewStream returns a stateful StreamTokenizer for processing an io.Reader.
	NewStream(io.Reader) StreamTokenizer
	// Separator returns the string that should be used to join tokens
	// when building a final generated string, using the previous and current
	// tokens.
	Separator(prev, current string) string
	// EOC returns the string representation for an End-Of-Chain token
	// in the final generated output, using the last token in the sequence.
	EOC(last string) string
}

Tokenizer is an interface that defines the contract for splitting input text into tokens. This allows the core generator logic to be independent of the specific tokenization strategy.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL