Documentation
¶
Overview ¶
Package split provides an extensible interface for splitting strings in meaningful ways: words, sentences, paragraphs, and more.
The MIT License (MIT)
Copyright (c) 2015 Kevin S. Dias
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Index ¶
- func NewIterTokenizer(opts ...TokenizerOptFunc) (*iterTokenizer, error)
- type PragmaticSegmenter
- type PunktSentenceTokenizer
- type RegexpTokenizer
- type Splitter
- type TokenTester
- type TokenizerOptFunc
- func UsingContractions(x []string) TokenizerOptFunc
- func UsingEmoticons(x map[string]int) TokenizerOptFunc
- func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
- func UsingPrefixes(x []string) TokenizerOptFunc
- func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
- func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
- func UsingSplitCases(x []string) TokenizerOptFunc
- func UsingSuffixes(x []string) TokenizerOptFunc
- type TreebankWordTokenizer
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func NewIterTokenizer ¶
func NewIterTokenizer(opts ...TokenizerOptFunc) (*iterTokenizer, error)
Constructor for default iterTokenizer
Types ¶
type PragmaticSegmenter ¶
type PragmaticSegmenter struct {
// contains filtered or unexported fields
}
PragmaticSegmenter is a multilingual, rule-based sentence boundary detector.
This is a port of the Ruby library by Kevin S. Dias (https://github.com/diasks2/pragmatic_segmenter).
func NewPragmaticSegmenter ¶
func NewPragmaticSegmenter(lang string) (*PragmaticSegmenter, error)
NewPragmaticSegmenter creates a new PragmaticSegmenter according to the specified language. If the given language is not supported, an error will be returned.
Languages are specified by their two-character ISO 639-1 code. The supported languages are "en" (English), "es" (Spanish), "fr" (French) ... (WIP)
func (*PragmaticSegmenter) Tokenize ¶
func (p *PragmaticSegmenter) Tokenize(text string) []string
Tokenize splits text into sentences.
type PunktSentenceTokenizer ¶
type PunktSentenceTokenizer struct {
// contains filtered or unexported fields
}
PunktSentenceTokenizer is an extension of the Go implementation of the Punkt sentence tokenizer (https://github.com/neurosnap/sentences), with a few minor improvements (see https://github.com/neurosnap/sentences/pull/18).
func NewPunktSentenceTokenizer ¶
func NewPunktSentenceTokenizer() (*PunktSentenceTokenizer, error)
NewPunktSentenceTokenizer creates a new PunktSentenceTokenizer and loads its English model.
func (PunktSentenceTokenizer) Split ¶
func (p PunktSentenceTokenizer) Split(text string) []string
Tokenize splits text into sentences.
type RegexpTokenizer ¶
type RegexpTokenizer struct {
// contains filtered or unexported fields
}
RegexpTokenizer splits a string into substrings using a regular expression.
func NewBlanklineTokenizer ¶
func NewBlanklineTokenizer() (*RegexpTokenizer, error)
NewBlanklineTokenizer is a RegexpTokenizer constructor.
This tokenizer splits on any sequence of blank lines.
Example ¶
t, err := NewBlanklineTokenizer() if err != nil { panic(err) } fmt.Println(t.Split("They'll save and invest more.\n\nThanks!"))
Output: [They'll save and invest more. Thanks!]
func NewRegexpTokenizer ¶
func NewRegexpTokenizer(pattern string, gaps, discard bool) (*RegexpTokenizer, error)
NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.
func NewWordBoundaryTokenizer ¶
func NewWordBoundaryTokenizer() (*RegexpTokenizer, error)
NewWordBoundaryTokenizer is a RegexpTokenizer constructor.
This tokenizer splits text into a sequence of word-like tokens.
Example ¶
t, err := NewWordBoundaryTokenizer() if err != nil { panic(err) } fmt.Println(t.Split("They'll save and invest more."))
Output: [They'll save and invest more]
func NewWordPunctTokenizer ¶
func NewWordPunctTokenizer() (*RegexpTokenizer, error)
NewWordPunctTokenizer is a RegexpTokenizer constructor.
This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.
Example ¶
t, err := NewWordPunctTokenizer() if err != nil { panic(err) } fmt.Println(t.Split("They'll save and invest more."))
Output: [They ' ll save and invest more .]
func (RegexpTokenizer) Split ¶
func (r RegexpTokenizer) Split(text string) []string
Split splits text into a slice of tokens according to its regexp pattern.
type TokenTester ¶
type TokenizerOptFunc ¶
type TokenizerOptFunc func(*iterTokenizer)
func UsingContractions ¶
func UsingContractions(x []string) TokenizerOptFunc
Use the provided contractions.
func UsingEmoticons ¶
func UsingEmoticons(x map[string]int) TokenizerOptFunc
Use the provided map of emoticons.
func UsingIsUnsplittable ¶
func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
UsingIsUnsplittableFN gives a function that tests whether a token is splittable or not.
func UsingSanitizer ¶
func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
Use the provided sanitizer.
func UsingSpecialRE ¶
func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
Use the provided special regex for unsplittable tokens.
func UsingSplitCases ¶
func UsingSplitCases(x []string) TokenizerOptFunc
Use the provided splitCases.
type TreebankWordTokenizer ¶
type TreebankWordTokenizer struct { }
TreebankWordTokenizer splits a sentence into words.
This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.
func NewTreebankWordTokenizer ¶
func NewTreebankWordTokenizer() (*TreebankWordTokenizer, error)
NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.
Example ¶
t, err := NewTreebankWordTokenizer() if err != nil { panic(err) } fmt.Println(t.Split("They'll save and invest more."))
Output: [They 'll save and invest more .]
func (TreebankWordTokenizer) Split ¶
func (t TreebankWordTokenizer) Split(text string) []string
Tokenize splits a sentence into a slice of words.
This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.
NOTE: As mentioned above, this function expects a sentence (not raw text) as input.