sego

package module

v0.0.0-...-feafb84 Latest Latest Go to latest Published: Jul 13, 2022 License: Apache-2.0 Imports: 13 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sajari/sego

Links

Open Source Insights

README ¶

sego

Go中文分词

词典用双数组trie（Double-Array Trie）实现，分词器算法为基于词频的最短路径加动态规划。

支持普通和搜索引擎两种分词模式，支持用户词典、词性标注，可运行JSON RPC服务。

分词速度单线程9MB/s，goroutines并发42MB/s（8核Macbook Pro）。

安装/更新

go get -u github.com/huichen/sego

Building dictionary file

go get -u github.com/go-bindata/go-bindata
go generate

使用

package main

import (
	"fmt"
	"github.com/huichen/sego"
)

func main() {
	// 载入词典
	var segmenter sego.Segmenter
	segmenter.LoadDictionary("github.com/huichen/sego/data/dictionary.txt")

	// 分词
	text := []byte("中华人民共和国中央人民政府")
	segments := segmenter.Segment(text)
  
	// 处理分词结果
	// 支持普通模式和搜索模式两种分词，见代码中SegmentsToString函数的注释。
	fmt.Println(sego.SegmentsToString(segments, false)) 
}

Documentation ¶

Overview ¶

Go中文分词

Index ¶

func Join(a []Text) string
func SegmentsToSlice(segs []Segment, searchMode bool) (output []string)
func SegmentsToString(segs []Segment, searchMode bool) (output string)
type Dictionary
- func NewDictionary() *Dictionary
type Segment
type Segmenter
- func DefaultSegmenter() *Segmenter
type Text
type Token

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Join ¶

func Join(a []Text) string

func SegmentsToSlice ¶

func SegmentsToSlice(segs []Segment, searchMode bool) (output []string)

Types ¶

type Dictionary ¶

type Dictionary struct {
	// contains filtered or unexported fields
}

Dictionary结构体实现了一个字串前缀树，一个分词可能出现在叶子节点也有可能出现在非叶节点

func NewDictionary ¶

func NewDictionary() *Dictionary

func (*Dictionary) MaxTokenLength ¶

func (dict *Dictionary) MaxTokenLength() int

词典中最长的分词

func (*Dictionary) NumTokens ¶

func (dict *Dictionary) NumTokens() int

词典中分词数目

func (*Dictionary) TotalFrequency ¶

func (dict *Dictionary) TotalFrequency() int64

词典中所有分词的频率之和

func (*Segment) End ¶

func (s *Segment) End() int

返回分词在文本中的结束字节位置（不包括该位置）

func (*Segment) Start ¶

func (s *Segment) Start() int

返回分词在文本中的起始字节位置

func (*Segment) Token ¶

func (s *Segment) Token() *Token

返回分词信息

func DefaultSegmenter ¶

func DefaultSegmenter() *Segmenter

DefaultSegmenter creates a new Segmenter with the default dictionary loaded

func (*Segmenter) Dictionary ¶

func (seg *Segmenter) Dictionary() *Dictionary

Dictionary returns the dictionary

func (*Segmenter) InternalSegment ¶

func (seg *Segmenter) InternalSegment(bytes []byte, searchMode bool) []Segment

func (*Segmenter) LoadDefaultDictionary ¶

func (seg *Segmenter) LoadDefaultDictionary()

LoadDefaultDictionary loads the default dictionary stored in data

func (*Segmenter) LoadDictionary ¶

func (seg *Segmenter) LoadDictionary(files ...string) error

LoadDictionary loads a dictionary from a file

Multiple dictionary files can be loaded, with filenames separated by ",". "User Dictionary.txt, Common Dictionary.txt" When a participle appears in both the user dictionary and the general dictionary, the user dictionary is used preferentially.

The format of the dictionary is (one line per participle): Word segmentation text Frequency Part of speech

func (*Segmenter) LoadDictionaryFromReader ¶

func (seg *Segmenter) LoadDictionaryFromReader(r io.Reader)

LoadDictionaryFromReader loads a dictionary from an io.Reader

The format of the dictionary is (one line per participle): Word segmentation text Frequency Part of speech

func (*Segmenter) Segment ¶

func (seg *Segmenter) Segment(bytes []byte) []Segment

对文本分词

输入参数：

bytes	UTF8文本的字节数组

输出：

[]Segment	划分的分词

type Text ¶

type Text []byte

字串类型，可以用来表达

一个字元，比如"中"又如"国", 英文的一个字元是一个词
一个分词，比如"中国"又如"人口"
一段文字，比如"中国有十三亿人口"

type Token ¶

type Token struct {
	// contains filtered or unexported fields
}

一个分词

func (*Token) Frequency ¶

func (token *Token) Frequency() int

返回分词在语料库中的词频

func (*Token) Pos ¶

func (token *Token) Pos() string

返回分词词性标注

func (*Token) Segments ¶

func (token *Token) Segments() []*Segment

该分词文本的进一步分词划分，比如"中华人民共和国中央人民政府"这个分词有两个子分词"中华人民共和国"和"中央人民政府"。子分词也可以进一步有子分词形成一个树结构，遍历这个树就可以得到该分词的所有细致分词划分，这主要用于搜索引擎对一段文本进行全文搜索。

func (*Token) Text ¶

func (token *Token) Text() string

返回分词文本

func (*Token) TextEquals ¶

func (token *Token) TextEquals(string string) bool

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
data
server

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

sego

安装/更新

Building dictionary file

使用

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func Join ¶

func SegmentsToSlice ¶

func SegmentsToString ¶

Types ¶

type Dictionary ¶

func NewDictionary ¶

func (*Dictionary) MaxTokenLength ¶

func (*Dictionary) NumTokens ¶

func (*Dictionary) TotalFrequency ¶

type Segment ¶

func (*Segment) End ¶

func (*Segment) Start ¶

func (*Segment) Token ¶

type Segmenter ¶

func DefaultSegmenter ¶

func (*Segmenter) Dictionary ¶

func (*Segmenter) InternalSegment ¶

func (*Segmenter) LoadDefaultDictionary ¶

func (*Segmenter) LoadDictionary ¶

func (*Segmenter) LoadDictionaryFromReader ¶

func (*Segmenter) Segment ¶

type Text ¶

type Token ¶

func (*Token) Frequency ¶

func (*Token) Pos ¶

func (*Token) Segments ¶

func (*Token) Text ¶

func (*Token) TextEquals ¶

Source Files ¶

Directories ¶