Documentation
¶
Index ¶
- func AgreeingOCRs(t T, i, n int) (float64, bool)
- func ApplyOCRToCorrection(ocr, sug string) string
- func CandidateAgreeingOCR(t T, i, n int) (float64, bool)
- func CandidateHistPatternConf(t T, i, n int) (float64, bool)
- func CandidateHistPatternConfLog(t T, i, n int) (float64, bool)
- func CandidateLen(t T, i, n int) (float64, bool)
- func CandidateLevenshteinDist(t T, i, n int) (float64, bool)
- func CandidateMatchesOCR(t T, i, n int) (float64, bool)
- func CandidateMaxTrigramFreq(t T, i, n int) (float64, bool)
- func CandidateMinTrigramFreq(t T, i, n int) (float64, bool)
- func CandidateOCRPatternConf(t T, i, n int) (float64, bool)
- func CandidateOCRPatternConfLog(t T, i, n int) (float64, bool)
- func CandidateProfilerWeight(t T, i, n int) (float64, bool)
- func CandidateTrigramFreq(t T, i, n int) (float64, bool)
- func CandidateTrigramFreqLog(t T, i, n int) (float64, bool)
- func CandidateUnigramFreq(t T, i, n int) (float64, bool)
- func DocumentLexicality(t T, i, n int) (float64, bool)
- func EachToken(ctx context.Context, in <-chan T, f func(T) error) error
- func EachTokenGroup(ctx context.Context, in <-chan T, f func(string, ...T) error) error
- func EachTokenLM(ctx context.Context, in <-chan T, f func(*LanguageModel, ...T) error) error
- func EachTrigram(str string, f func(string))
- func Log(f string, args ...interface{})
- func LogEnabled() bool
- func OCRLevenshteinDist(t T, i, n int) (float64, bool)
- func OCRMaxCharConf(t T, i, n int) (float64, bool)
- func OCRMaxTrigramFreq(t T, i, n int) (float64, bool)
- func OCRMinCharConf(t T, i, n int) (float64, bool)
- func OCRMinTrigramFreq(t T, i, n int) (float64, bool)
- func OCRTokenLen(t T, i, n int) (float64, bool)
- func OCRTrigramFreq(t T, i, n int) (float64, bool)
- func OCRUnigramFreq(t T, i, n int) (float64, bool)
- func Pipe(ctx context.Context, fns ...StreamFunc) error
- func RankingCandidateConfDiffToNext(t T, i, n int) (float64, bool)
- func RankingConf(t T, i, n int) (float64, bool)
- func RankingConfDiffToNext(t T, i, n int) (float64, bool)
- func ReadProfile(name string) (gofiler.Profile, error)
- func RunProfiler(ctx context.Context, exe, config string, tokens ...T) (gofiler.Profile, error)
- func SendTokens(ctx context.Context, out chan<- T, tokens ...T) error
- func SetLog(enable bool)
- func WriteProfile(name string, profile gofiler.Profile) error
- type Char
- type Chars
- type Correction
- type FeatureFunc
- type FeatureSet
- type FreqList
- type LanguageModel
- func (lm *LanguageModel) AddUnigram(token string)
- func (lm *LanguageModel) EachTrigram(str string, f func(float64))
- func (lm *LanguageModel) LoadGzippedNGram(path string) error
- func (lm *LanguageModel) ReadProfile(ctx context.Context, exe, config string, cache bool, tokens ...T) error
- func (lm *LanguageModel) Trigram(str string) float64
- func (lm *LanguageModel) TrigramLog(str string) float64
- func (lm *LanguageModel) Unigram(str string) float64
- type Model
- type ModelData
- type Ranking
- type StreamFunc
- func Combine(ctx context.Context, fns ...StreamFunc) StreamFunc
- func ConnectCandidates() StreamFunc
- func ConnectCorrections(lr *ml.LR, fs FeatureSet, n int) StreamFunc
- func ConnectLM(ngrams FreqList) StreamFunc
- func ConnectProfile(exe, config string, cache bool) StreamFunc
- func ConnectRankings(lr *ml.LR, fs FeatureSet, n int) StreamFunc
- func ConnectUnigrams() StreamFunc
- func FilterBad(min int) StreamFunc
- func FilterLexiconEntries() StreamFunc
- func FilterShort(min int) StreamFunc
- func Normalize() StreamFunc
- func Tee(fns ...func(T) error) StreamFunc
- type T
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func AgreeingOCRs ¶
AgreeingOCRs returns the number of OCRs that aggree with the master OCR token.
func ApplyOCRToCorrection ¶
ApplyOCRToCorrection applies the casing of the master OCR string to the correction's candidate suggestion and prepends and appends any punctuation of the master OCR to the suggestion.
func CandidateAgreeingOCR ¶
CandidateAgreeingOCR returns the number of OCR tokens that agree with the specific profiler candidate of the token.
func CandidateHistPatternConf ¶
CandidateHistPatternConf returns the product of the confidences of the primary OCR characters for the assumed historical rewrite pattern of the connected candidate.
func CandidateHistPatternConfLog ¶ added in v0.0.17
CandidateHistPatternConfLog returns the sum of the logrithm of the confidences of the primary OCR characters for the assumed historical rewrite pattern of the connected candidate.
func CandidateLen ¶
CandidateLen returns the length of the connected profiler candidate.
func CandidateLevenshteinDist ¶
CandidateLevenshteinDist returns the levenshtein distance between the OCR token and the token's connected profiler candidate. For the master OCR the according Distance from the profiler candidate is used, whereas for support OCRs the levenshtein distance is calculated.
func CandidateMatchesOCR ¶
CandidateMatchesOCR returns true if the according ocr matches the connected candidate and false otherwise.
func CandidateMaxTrigramFreq ¶
CandidateMaxTrigramFreq returns the maximal trigram frequenzy for the connected candidate.
func CandidateMinTrigramFreq ¶
CandidateMinTrigramFreq returns the minimal trigram frequezy for the connected candidate.
func CandidateOCRPatternConf ¶
CandidateOCRPatternConf returns the product of the confidences of the primary OCR characters for the assumed OCR error pattern of the connected candidate. TODO: rename to CandiateErrPatternConf
func CandidateOCRPatternConfLog ¶ added in v0.0.16
CandidateOCRPatternConfLog returns the sum of the logarithm of the confidences of the primary OCR characters for the assumed OCR error pattern of the connected candidate.
func CandidateProfilerWeight ¶
CandidateProfilerWeight returns the profiler confidence value for tokens candidate.
func CandidateTrigramFreq ¶
CandidateTrigramFreq returns the product of the candidate's trigrams.
func CandidateTrigramFreqLog ¶ added in v0.0.17
CandidateTrigramFreqLog returns the product of the candidate's trigrams.
func CandidateUnigramFreq ¶
CandidateUnigramFreq returns the relative frequency of the token's candidate.
func DocumentLexicality ¶
DocumentLexicality returns the (global) lexicality of the given token's document. Using this feature only makes sense if the training contains at least more than one training document.
func EachToken ¶
EachToken iterates over the tokens in the input channel and calls the callback function for each token.
func EachTokenGroup ¶ added in v0.0.6
EachTokenGroup iterates over the tokens grouping them together based on their groups. The given callback function is called for each group of tokens.
func EachTokenLM ¶ added in v0.0.29
EachTokenLM iterates over the tokens grouping them together based on their language models. The given callback function is called for each group of tokens. This function assumes that the tokens are connected with a language model.
func EachTrigram ¶ added in v0.0.18
EachTrigram calls the given callback function for each trigram in the given string.
func Log ¶ added in v0.0.21
func Log(f string, args ...interface{})
Log logs the given message if logging is enabled. This function uses log.Printf for logging, so it is save to be used concurrently.
func LogEnabled ¶ added in v0.0.29
func LogEnabled() bool
LogEnabled returns true if logging is currently enabled.
func OCRLevenshteinDist ¶ added in v0.0.17
OCRLevenshteinDist returns the levenshtein distance between the secondary OCRs with the primary OCR.
func OCRMaxCharConf ¶
OCRMaxCharConf returns the maximal character confidence of the master OCR token.
func OCRMaxTrigramFreq ¶
OCRMaxTrigramFreq returns the maximal trigram relative frequenzy confidence of the tokens.
func OCRMinCharConf ¶
OCRMinCharConf returns the minimal character confidence of the master OCR token.
func OCRMinTrigramFreq ¶
OCRMinTrigramFreq returns the minimal trigram relative frequenzy confidence of the tokens.
func OCRTokenLen ¶
OCRTokenLen returns the length of the OCR token. It operates on any configuration.
func OCRTrigramFreq ¶
OCRTrigramFreq returns the product of the OCR token's trigrams.
func OCRUnigramFreq ¶
OCRUnigramFreq returns the relative frequency of the OCR token in the unigram language model.
func Pipe ¶
func Pipe(ctx context.Context, fns ...StreamFunc) error
Pipe pipes multiple stream funcs together, making shure to run all of them concurently. The first function in the list (the reader) is called with a nil input channel. The last function is always called with a nil output channel. To clarify: the first function must never read from its input channel and the last function must never write to its output channel.
StreamFunctions should transform the input tokens to output tokens. They must never close any channels. They should use the SendTokens, ReadToken and EachToken utility functions to ensure proper handling of context cancelation.
func RankingCandidateConfDiffToNext ¶
RankingCandidateConfDiffToNext returns the top ranked candidate's weight minus the the weight of the next (or 0).
func RankingConf ¶
RankingConf returns the confidence of the best ranked correction candidate for the given token.
func RankingConfDiffToNext ¶
RankingConfDiffToNext returns the difference of the best ranked correction candidate's confidence to the next. If only one correction candidate is available, the next ranking's confidence is assumed to be 0.
func ReadProfile ¶ added in v0.0.17
ReadProfile reads the profile from a gzipped json formatted file.
func RunProfiler ¶ added in v0.0.17
RunProfiler runs the profiler over the given tokens (using the token entries at index 0) with the given executable and config file. The profiler's output is logged to stderr.
func SendTokens ¶
SendTokens writes tokens into the given output channel. This function should always be used to write tokens into output channels.
Types ¶
type Chars ¶
type Chars []Char
Chars represents the master OCR chars with the respective confidences.
type Correction ¶
Correction represents a correction decision for tokens.
type FeatureFunc ¶
FeatureFunc defines the function a feature needs to implement. A feature func gets a token and a configuration (the current OCR-index i and the total number of parallel OCRs n). The function then should return the feature value for the given token and wether this feature applies for the given configuration (i and n).
type FeatureSet ¶
type FeatureSet []FeatureFunc
FeatureSet is just a list of feature funcs.
func NewFeatureSet ¶
func NewFeatureSet(names ...string) (FeatureSet, error)
NewFeatureSet creates a new feature set from the list of feature function names.
func (FeatureSet) Calculate ¶
func (fs FeatureSet) Calculate(xs []float64, t T, n int) []float64
Calculate calculates the feature vector for the given feature functions for the given token and the given number of OCRs and appends it to the given vector. Any given feature function that does not apply to the given configuration (and returns false as it second return parameter for the configuration) is omitted and not appended to the resulting feature vector.
func (FeatureSet) Names ¶ added in v0.0.21
func (fs FeatureSet) Names(names []string, nocr int, dm bool) []string
Names returns the names of the features including the features for different values of OCR's. This function panics if the length of the feature set differs from the length of the given feature names.
type LanguageModel ¶
type LanguageModel struct { Ngrams FreqList Unigrams FreqList Profile gofiler.Profile Lexicality float64 }
LanguageModel consists of holds the language model for tokens.
func (*LanguageModel) AddUnigram ¶
func (lm *LanguageModel) AddUnigram(token string)
AddUnigram adds the token to the language model's unigram map.
func (*LanguageModel) EachTrigram ¶
func (lm *LanguageModel) EachTrigram(str string, f func(float64))
EachTrigram looks up the trigrams of the given token and returns the product of the token's trigrams.
func (*LanguageModel) LoadGzippedNGram ¶
func (lm *LanguageModel) LoadGzippedNGram(path string) error
LoadGzippedNGram loads the (gzipped) ngram model file. The expected format for each line is `%d,%s`.
func (*LanguageModel) ReadProfile ¶ added in v0.0.29
func (lm *LanguageModel) ReadProfile(ctx context.Context, exe, config string, cache bool, tokens ...T) error
ReadProfile loads the profile for the master OCR tokens.
func (*LanguageModel) Trigram ¶
func (lm *LanguageModel) Trigram(str string) float64
Trigram looks up the trigrams of the given token and returns the product of the token's trigrams.
func (*LanguageModel) TrigramLog ¶ added in v0.0.17
func (lm *LanguageModel) TrigramLog(str string) float64
TrigramLog looks up the trigrams of the given token and returns the sum of the logarithmic relative frequency of the token's trigrams.
func (*LanguageModel) Unigram ¶
func (lm *LanguageModel) Unigram(str string) float64
Unigram looks up the given token in the unigram list (or 0 if the unigram is not present).
type Model ¶
type Model struct { Models map[string]map[int]ModelData GlobalHistPatterns map[string]float64 GlobalOCRPatterns map[string]float64 Ngrams FreqList }
Model holds the different models for the different number of OCRs.
func ReadModel ¶
ReadModel reads a model from a gob compressed input file. If the given file does not exist, an empty model is returned. If the model does not contain a valid ngram frequency list, the list is loaded from the given path.
func (Model) Get ¶
Get loads the the model and the according feature set for the given configuration.
type StreamFunc ¶
StreamFunc is a type def for stream functions. A stream function is used to transform tokens from the input channel to the output channel. They should be used with the Pipe function to chain multiple functions together.
func Combine ¶ added in v0.0.14
func Combine(ctx context.Context, fns ...StreamFunc) StreamFunc
Combine lets you combine stream functions. All functions are run concurently in their own error group.
func ConnectCandidates ¶
func ConnectCandidates() StreamFunc
ConnectCandidates returns a stream function that connects tokens with their respective candidates to the stream. Tokens with no candidates or tokens with only a modern interpretation are filtered from the stream.
func ConnectCorrections ¶
func ConnectCorrections(lr *ml.LR, fs FeatureSet, n int) StreamFunc
ConnectCorrections connects the tokens with the decider's correction decisions.
func ConnectLM ¶
func ConnectLM(ngrams FreqList) StreamFunc
ConnectLM connects the language model to the tokens.
func ConnectProfile ¶ added in v0.0.29
func ConnectProfile(exe, config string, cache bool) StreamFunc
ConnectProfiler connects the profile with the tokens. This function must be called after ConnectLM.
func ConnectRankings ¶
func ConnectRankings(lr *ml.LR, fs FeatureSet, n int) StreamFunc
ConnectRankings connects the tokens of the input stream with their respective rankings.
func ConnectUnigrams ¶ added in v0.0.29
func ConnectUnigrams() StreamFunc
ConnectUnigrams adds the unigrams to the tokens's language model.
func FilterBad ¶
func FilterBad(min int) StreamFunc
FilterBad returns a astream function that filters tokens with not enough ocr and/or gt tokens.
func FilterLexiconEntries ¶
func FilterLexiconEntries() StreamFunc
FilterLexiconEntries returns a stream function that filters all tokens that are lexicon entries from the stream.
func FilterShort ¶
func FilterShort(min int) StreamFunc
FilterShort returns a stream function that filters short master OCR tokens from the input stream. Short tokens are tokens, with less than min unicode characters.
func Normalize ¶
func Normalize() StreamFunc
Normalize returns a stream function that trims all leading and subsequent punctionation from the tokens, converts them to lowercase and replaces any whitespace (in the case of merges due to alignment) with a '_'.
func Tee ¶ added in v0.0.14
func Tee(fns ...func(T) error) StreamFunc
Tee calls all the given callback function for each token. After all functions have been called, if the output channel is not nil, the token is send to the output channel.
type T ¶ added in v0.0.14
type T struct { LM *LanguageModel // language model for this token Payload interface{} // token payload; *gofiler.Candidate, []Ranking or Correction Cor string // Correction for the token File string // the file of the token Group string // file group of the token ID string // id of the token in this file Chars Chars // master OCR chars with their confidences Tokens []string // master and support OCRs and gt }
T represents aligned OCR-tokens.
func ReadToken ¶
ReadToken reads one token from the given channel. This function should alsways be used to read single tokens from input channels.
func (T) IsLexiconEntry ¶ added in v0.0.14
IsLexiconEntry returns true if this token is a normal lexicon entry for its connected language model.