split

package module

v0.1.0 Latest Latest Go to latest Published: Sep 25, 2021 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/GolangNLP/split

Links

Open Source Insights

README ¶

split

Package split provides an extensible interface for splitting strings in meaningful ways: words, sentences, paragraphs, and more.

Documentation ¶

Overview ¶

Package split provides an extensible interface for splitting strings in meaningful ways: words, sentences, paragraphs, and more.

The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Index ¶

func NewIterTokenizer(opts ...TokenizerOptFunc) (*iterTokenizer, error)
type PragmaticSegmenter
- func NewPragmaticSegmenter(lang string) (*PragmaticSegmenter, error)
- func (p *PragmaticSegmenter) Tokenize(text string) []string
type PunktSentenceTokenizer
- func NewPunktSentenceTokenizer() (*PunktSentenceTokenizer, error)
- func (p PunktSentenceTokenizer) Split(text string) []string
type RegexpTokenizer
- func (r RegexpTokenizer) Split(text string) []string
type Splitter
type TokenTester
type TokenizerOptFunc
type TreebankWordTokenizer
- func NewTreebankWordTokenizer() (*TreebankWordTokenizer, error)
- func (t TreebankWordTokenizer) Split(text string) []string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NewIterTokenizer ¶

func NewIterTokenizer(opts ...TokenizerOptFunc) (*iterTokenizer, error)

Constructor for default iterTokenizer

Types ¶

type PragmaticSegmenter ¶

type PragmaticSegmenter struct {
	// contains filtered or unexported fields
}

PragmaticSegmenter is a multilingual, rule-based sentence boundary detector.

This is a port of the Ruby library by Kevin S. Dias (https://github.com/diasks2/pragmatic_segmenter).

func NewPragmaticSegmenter ¶

func NewPragmaticSegmenter(lang string) (*PragmaticSegmenter, error)

NewPragmaticSegmenter creates a new PragmaticSegmenter according to the specified language. If the given language is not supported, an error will be returned.

Languages are specified by their two-character ISO 639-1 code. The supported languages are "en" (English), "es" (Spanish), "fr" (French) ... (WIP)

func (*PragmaticSegmenter) Tokenize ¶

func (p *PragmaticSegmenter) Tokenize(text string) []string

Tokenize splits text into sentences.

type PunktSentenceTokenizer ¶

type PunktSentenceTokenizer struct {
	// contains filtered or unexported fields
}

PunktSentenceTokenizer is an extension of the Go implementation of the Punkt sentence tokenizer (https://github.com/neurosnap/sentences), with a few minor improvements (see https://github.com/neurosnap/sentences/pull/18).

func NewPunktSentenceTokenizer ¶

func NewPunktSentenceTokenizer() (*PunktSentenceTokenizer, error)

NewPunktSentenceTokenizer creates a new PunktSentenceTokenizer and loads its English model.

func (PunktSentenceTokenizer) Split ¶

func (p PunktSentenceTokenizer) Split(text string) []string

Tokenize splits text into sentences.

type RegexpTokenizer ¶

type RegexpTokenizer struct {
	// contains filtered or unexported fields
}

RegexpTokenizer splits a string into substrings using a regular expression.

func NewBlanklineTokenizer ¶

func NewBlanklineTokenizer() (*RegexpTokenizer, error)

NewBlanklineTokenizer is a RegexpTokenizer constructor.

This tokenizer splits on any sequence of blank lines.

Example ¶

t, err := NewBlanklineTokenizer()
if err != nil {
	panic(err)
}
fmt.Println(t.Split("They'll save and invest more.\n\nThanks!"))

Output:

[They'll save and invest more. Thanks!]

func NewRegexpTokenizer ¶

func NewRegexpTokenizer(pattern string, gaps, discard bool) (*RegexpTokenizer, error)

NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.

func NewWordBoundaryTokenizer ¶

func NewWordBoundaryTokenizer() (*RegexpTokenizer, error)

NewWordBoundaryTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of word-like tokens.

Example ¶

t, err := NewWordBoundaryTokenizer()
if err != nil {
	panic(err)
}
fmt.Println(t.Split("They'll save and invest more."))

Output:

[They'll save and invest more]

func NewWordPunctTokenizer ¶

func NewWordPunctTokenizer() (*RegexpTokenizer, error)

NewWordPunctTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.

Example ¶

t, err := NewWordPunctTokenizer()
if err != nil {
	panic(err)
}
fmt.Println(t.Split("They'll save and invest more."))

Output:

[They ' ll save and invest more .]

func (RegexpTokenizer) Split ¶

func (r RegexpTokenizer) Split(text string) []string

Split splits text into a slice of tokens according to its regexp pattern.

type Splitter ¶

type Splitter interface {
	Split(s string) []string
}

Splitter splits a string into substrings according to specially-defined rules.

type TokenTester ¶

type TokenTester func(string) bool

type TokenizerOptFunc ¶

type TokenizerOptFunc func(*iterTokenizer)

func UsingContractions ¶

func UsingContractions(x []string) TokenizerOptFunc

Use the provided contractions.

func UsingEmoticons ¶

func UsingEmoticons(x map[string]int) TokenizerOptFunc

Use the provided map of emoticons.

func UsingIsUnsplittable ¶

func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc

UsingIsUnsplittableFN gives a function that tests whether a token is splittable or not.

func UsingPrefixes ¶

func UsingPrefixes(x []string) TokenizerOptFunc

Use the provided prefixes.

func UsingSanitizer ¶

func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc

Use the provided sanitizer.

func UsingSpecialRE ¶

func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc

Use the provided special regex for unsplittable tokens.

func UsingSplitCases ¶

func UsingSplitCases(x []string) TokenizerOptFunc

Use the provided splitCases.

func UsingSuffixes ¶

func UsingSuffixes(x []string) TokenizerOptFunc

Use the provided suffixes.

type TreebankWordTokenizer ¶

type TreebankWordTokenizer struct {
}

TreebankWordTokenizer splits a sentence into words.

This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.

func NewTreebankWordTokenizer ¶

func NewTreebankWordTokenizer() (*TreebankWordTokenizer, error)

NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.

Example ¶

t, err := NewTreebankWordTokenizer()
if err != nil {
	panic(err)
}
fmt.Println(t.Split("They'll save and invest more."))

Output:

[They 'll save and invest more .]

func (TreebankWordTokenizer) Split ¶

func (t TreebankWordTokenizer) Split(text string) []string

Tokenize splits a sentence into a slice of words.

This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.

NOTE: As mentioned above, this function expects a sentence (not raw text) as input.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL