bag

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 12, 2024 License: MIT Imports: 6 Imported by: 0

README

Bag GoDoc Status Go Report Card Go Test Coverage

All Contributors

Bag is a powerful yet user-friendly bag of words (BoW) implementation written in Go, leveraging a Naive Bayes classifier for efficient text analysis. It functions both as a library that can be seamlessly integrated into Go code and as an accessible command line tool. This dual functionality allows users to leverage bag of words capabilities directly from the command line, making it accessible from any programming language. The implementation supports a file format that facilitates using bag of words functionality as code, designed for ease of use and flexible integration in various environments.

billboard

What is Bag of Words (BoW)?

The bag of words (BoW) model is a fundamental text representation technique in natural language processing (NLP). In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and word order but keeping multiplicity. The key idea is to create a vocabulary of all the unique words in the text corpus and then represent each text by a vector of word frequencies or binary indicators. This vector indicates the presence or absence, or frequency, of each word from the vocabulary within the text. The BoW model is widely used for text classification tasks, including sentiment analysis, due to its simplicity and effectiveness in capturing word occurrences.

Demo

Examples

New
func ExampleNew() {
	var cfg Config
	// Initialize with default values
	exampleBag = New(cfg)
}
NewFromTrainingSet
func ExampleNewFromTrainingSet() {
	var t TrainingSet
	t.Samples = SamplesByLabel{
		"positive": {
			"I love this product, it is amazing!",
			"I am very happy with this.",
			"Very good",
		},

		"negative": {
			"This is the worst thing ever.",
			"I hate this so much.",
			"Not good",
		},
	}

	// Initialize with default values
	exampleBag = NewFromTrainingSet(t)
}
Bag.Train
func ExampleBag_Train() {
	exampleBag.Train("I love this product, it is amazing!", "positive")
	exampleBag.Train("This is the worst thing ever.", "negative")
	exampleBag.Train("I am very happy with this.", "positive")
	exampleBag.Train("I hate this so much.", "negative")
	exampleBag.Train("Not good", "negative")
	exampleBag.Train("Very good", "negative")
}
Bag.GetResults
func ExampleBag_GetResults() {
	exampleResults = exampleBag.GetResults("I am very happy with this product.")
	fmt.Println("Collection of results", exampleResults)
}
Results.GetHighestProbability
func ExampleResults_GetHighestProbability() {
	match := exampleResults.GetHighestProbability()
	fmt.Println("Highest probability", match)
}
TrainingSet File
config:
  ngram-size: 1
samples:
  yes:
    - "yes"
    - "Yeah"
    - "Yep"

  no:
    - "No"
    - "Nope"
    - "Nah"

# Note: This training set is short for the sake of README filesize,
# please look in the examples directory for more complete examples

Road to v1.0.0

  • Working implementation as Go library
  • Training sets
  • Support Character NGrams
  • Text normalization added to inbound text processing
  • CLI utility

Long term goals

  • Generated model as MMAP file

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Josh Montoya
Josh Montoya

💻 📖
Matt Stay
Matt Stay

🎨
Chewxy
Chewxy

⚠️
Jack Muir
Jack Muir

⚠️

This project follows the all-contributors specification. Contributions of any kind welcome!

Documentation

Index

Examples

Constants

View Source
const (
	// DefaultNGramSize is set to 3 (trigram)
	DefaultNGramSize = 3
	// DefaultNGramType is set to word
	DefaultNGramType = "word"
	// DefaultSmoothingParameter is set to 1 (common Laplace smoothing value)
	DefaultSmoothingParameter = 1
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Bag

type Bag struct {
	// contains filtered or unexported fields
}

Bag represents a bag of words (BoW) model

func New

func New(c Config) (out *Bag, err error)

New will initialize and return a new Bag with a provided configuration

Example
var (
	cfg Config
	err error
)

// Initialize with default values
if exampleBag, err = New(cfg); err != nil {
	log.Fatal(err)
}
Output:

func NewFromTrainingSet

func NewFromTrainingSet(t TrainingSet) (b *Bag, err error)

NewFromTrainingSet will initialize and return a new pre-trained Bag from a provided training set

Example
var (
	t   TrainingSet
	err error
)

t.Samples = SamplesByLabel{
	"positive": {
		"I love this product, it is amazing!",
		"I am very happy with this.",
		"Very good",
	},

	"negative": {
		"This is the worst thing ever.",
		"I hate this so much.",
		"Not good",
	},
}

// Initialize with default values
if exampleBag, err = NewFromTrainingSet(t); err != nil {
	log.Fatal(err)
}
Output:

func NewFromTrainingSetFile added in v0.4.1

func NewFromTrainingSetFile(filepath string) (b *Bag, err error)

NewFromTrainingSetFile will initialize and return a new pre-trained Bag from a provided training set filepath

func (*Bag) GetResults

func (b *Bag) GetResults(in string) (r Results)

GetResults will return the classification results for a given input string

Example
exampleResults = exampleBag.GetResults("I am very happy with this product.")
fmt.Println("Collection of results", exampleResults)
Output:

func (*Bag) Train

func (b *Bag) Train(in, label string)

Train will process a given input string and assign it the provided label for training

Example
exampleBag.Train("I love this product, it is amazing!", "positive")
exampleBag.Train("This is the worst thing ever.", "negative")
exampleBag.Train("I am very happy with this.", "positive")
exampleBag.Train("I hate this so much.", "negative")
exampleBag.Train("Not good", "negative")
exampleBag.Train("Very good", "negative")
Output:

type Config

type Config struct {
	// NGramSize represents the NGram size (unigram, bigram, trigram, etc - default is trigram)
	NGramSize int `yaml:"ngram-size"`
	// NGramType represents the NGram type (word or character - default is word)
	NGramType string `yaml:"ngram-type"`
	// SmoothingParameter represents the smoothing value used for the Laplace Smoothing (default is 1)
	SmoothingParameter float64 `yaml:"smoothing-parameter"`
}

func (*Config) Validate

func (c *Config) Validate() (err error)

type Results

type Results map[string]float64

Results (by label) represents the probability of a processed input matching each of the possible labels (classifications)

func (Results) GetHighestProbability

func (r Results) GetHighestProbability() (match string)
Example
match := exampleResults.GetHighestProbability()
fmt.Println("Highest probability", match)
Output:

type Samples

type Samples []string

Samples represents a set of input samples to be used for model training

type SamplesByLabel

type SamplesByLabel map[string]Samples

SamplesByLabel represents sets of samples keyed by label

type TrainingSet

type TrainingSet struct {
	Config `yaml:"config"`

	Samples SamplesByLabel `yaml:"samples"`
}

TrainingSet is used to train a bag of words (BoW) model

type Vocabulary

type Vocabulary map[string]int

Vocabulary represents number of encounters for a given n-gram as string

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL