mirror of
https://github.com/fxsjy/jieba.git
synced 2025-07-10 00:01:33 +08:00
commit
0d99ebce54
@ -1,3 +1,10 @@
|
||||
2014-02-07: version 0.32
|
||||
1. 新增分词选项:可以关闭新词发现功能;详见:https://github.com/fxsjy/jieba/blob/master/test/test_no_hmm.py#L8
|
||||
2. 修复posseg子模块的Bug;详见: https://github.com/fxsjy/jieba/issues/111 https://github.com/fxsjy/jieba/issues/132
|
||||
3. ChineseAnalyzer提供了更好的英文支持(感谢@jannson),例如单词Stemming; 详见:https://github.com/fxsjy/jieba/pull/106
|
||||
|
||||
|
||||
|
||||
2013-07-01: version 0.31
|
||||
1. 修改了代码缩进格式,遵循PEP8标准
|
||||
2. 支持Jython解析器,感谢 @piaolingxue
|
||||
|
122
README.md
122
README.md
@ -14,9 +14,9 @@ jieba
|
||||
Feature
|
||||
========
|
||||
* 支持三种分词模式:
|
||||
* 精确模式,试图将句子最精确地切开,适合文本分析;
|
||||
* 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
|
||||
* 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
|
||||
* 精确模式,试图将句子最精确地切开,适合文本分析;
|
||||
* 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
|
||||
* 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
|
||||
|
||||
* 支持繁体分词
|
||||
* 支持自定义词典
|
||||
@ -32,6 +32,7 @@ http://jiebademo.ap01.aws.af.cm/
|
||||
网站代码:https://github.com/fxsjy/jiebademo
|
||||
|
||||
|
||||
|
||||
Python 2.x 下的安装
|
||||
===================
|
||||
* 全自动安装:`easy_install jieba` 或者 `pip install jieba`
|
||||
@ -54,6 +55,21 @@ Python 3.x 下的安装
|
||||
作者:piaolingxue
|
||||
地址:https://github.com/huaban/jieba-analysis
|
||||
|
||||
结巴分词C++版本
|
||||
================
|
||||
作者:Aszxqw
|
||||
地址:https://github.com/aszxqw/cppjieba
|
||||
|
||||
结巴分词Node.js版本
|
||||
================
|
||||
作者:Aszxqw
|
||||
地址:https://github.com/aszxqw/nodejieba
|
||||
|
||||
结巴分词Erlang版本
|
||||
================
|
||||
作者:falood
|
||||
https://github.com/falood/exjieba
|
||||
|
||||
Algorithm
|
||||
========
|
||||
* 基于Trie树结构实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图(DAG)
|
||||
@ -76,11 +92,11 @@ Algorithm
|
||||
print("Full Mode:", "/ ".join(seg_list)) #全模式
|
||||
|
||||
seg_list = jieba.cut("我来到北京清华大学",cut_all=False)
|
||||
print("Default Mode:", "/ ".join(seg_list)) #默认模式
|
||||
print("Default Mode:", "/ ".join(seg_list)) #精确模式
|
||||
|
||||
|
||||
seg_list = jieba.cut("他来到了网易杭研大厦") #默认是精确模式
|
||||
print ", ".join(seg_list)
|
||||
print(", ".join(seg_list))
|
||||
|
||||
|
||||
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式
|
||||
@ -88,13 +104,13 @@ Algorithm
|
||||
|
||||
Output:
|
||||
|
||||
【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
|
||||
【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
|
||||
|
||||
【精确模式】: 我/ 来到/ 北京/ 清华大学
|
||||
【精确模式】: 我/ 来到/ 北京/ 清华大学
|
||||
|
||||
【新词识别】:他, 来到, 了, 网易, 杭研, 大厦 (此处,“杭研”并没有在词典中,但是也被Viterbi算法识别出来了)
|
||||
【新词识别】:他, 来到, 了, 网易, 杭研, 大厦 (此处,“杭研”并没有在词典中,但是也被Viterbi算法识别出来了)
|
||||
|
||||
【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
|
||||
【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
|
||||
|
||||
功能 2) :添加自定义词典
|
||||
================
|
||||
@ -104,16 +120,16 @@ Output:
|
||||
* 词典格式和`dict.txt`一样,一个词占一行;每一行分三部分,一部分为词语,另一部分为词频,最后为词性(可省略),用空格隔开
|
||||
* 范例:
|
||||
|
||||
* 自定义词典:https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
|
||||
* 自定义词典:https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
|
||||
|
||||
* 用法示例:https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
|
||||
|
||||
|
||||
* 用法示例:https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
|
||||
|
||||
|
||||
* 之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
|
||||
|
||||
* 加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
|
||||
* 之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
|
||||
|
||||
* 加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
|
||||
|
||||
|
||||
* "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
|
||||
|
||||
功能 3) :关键词提取
|
||||
@ -124,33 +140,33 @@ Output:
|
||||
|
||||
代码示例 (关键词提取)
|
||||
|
||||
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
|
||||
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
|
||||
|
||||
功能 4) : 词性标注
|
||||
================
|
||||
* 标注句子分词后每个词的词性,采用和ictclas兼容的标记法
|
||||
* 用法示例
|
||||
|
||||
>>> import jieba.posseg as pseg
|
||||
>>> words = pseg.cut("我爱北京天安门")
|
||||
>>> for w in words:
|
||||
... print w.word, w.flag
|
||||
...
|
||||
我 r
|
||||
爱 v
|
||||
北京 ns
|
||||
天安门 ns
|
||||
|
||||
>>> import jieba.posseg as pseg
|
||||
>>> words = pseg.cut("我爱北京天安门")
|
||||
>>> for w in words:
|
||||
... print w.word, w.flag
|
||||
...
|
||||
我 r
|
||||
爱 v
|
||||
北京 ns
|
||||
天安门 ns
|
||||
|
||||
功能 5) : 并行分词
|
||||
==================
|
||||
* 原理:将目标文本按行分隔后,把各行文本分配到多个python进程并行分词,然后归并结果,从而获得分词速度的可观提升
|
||||
* 基于python自带的multiprocessing模块,目前暂不支持windows
|
||||
* 用法:
|
||||
* `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数
|
||||
* `jieba.disable_parallel()` # 关闭并行分词模式
|
||||
* `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数
|
||||
* `jieba.disable_parallel()` # 关闭并行分词模式
|
||||
|
||||
* 例子:
|
||||
https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
|
||||
https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
|
||||
|
||||
* 实验结果:在4核3.4GHz Linux机器上,对金庸全集进行精确分词,获得了1MB/s的速度,是单进程版的3.3倍。
|
||||
|
||||
@ -190,8 +206,8 @@ word 有限 start: 6 end:8
|
||||
word 公司 start: 8 end:10
|
||||
word 有限公司 start: 6 end:10
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
功能 7) : ChineseAnalyzer for Whoosh搜索引擎
|
||||
============================================
|
||||
* 引用: `from jieba.analyse import ChineseAnalyzer `
|
||||
@ -215,7 +231,7 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
|
||||
jieba采用延迟加载,"import jieba"不会立即触发词典的加载,一旦有必要才开始加载词典构建trie。如果你想手工初始jieba,也可以手动初始化。
|
||||
|
||||
import jieba
|
||||
jieba.initialize() #手动初始化(可选)
|
||||
jieba.initialize() # 手动初始化(可选)
|
||||
|
||||
|
||||
在0.28之前的版本是不能指定主词典的路径的,有了延迟加载机制后,你可以改变主词典的路径:
|
||||
@ -280,30 +296,30 @@ Function 1): cut
|
||||
Code example: segmentation
|
||||
==========
|
||||
|
||||
#encoding=utf-8
|
||||
import jieba
|
||||
#encoding=utf-8
|
||||
import jieba
|
||||
|
||||
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
|
||||
print("Full Mode:", "/ ".join(seg_list)) # 全模式
|
||||
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
|
||||
print("Full Mode:", "/ ".join(seg_list)) # 全模式
|
||||
|
||||
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
|
||||
print("Default Mode:", "/ ".join(seg_list)) # 默认模式
|
||||
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
|
||||
print("Default Mode:", "/ ".join(seg_list)) # 默认模式
|
||||
|
||||
seg_list = jieba.cut("他来到了网易杭研大厦")
|
||||
print(", ".join(seg_list))
|
||||
seg_list = jieba.cut("他来到了网易杭研大厦")
|
||||
print(", ".join(seg_list))
|
||||
|
||||
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
|
||||
print(", ".join(seg_list))
|
||||
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
|
||||
print(", ".join(seg_list))
|
||||
|
||||
Output:
|
||||
|
||||
[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
|
||||
[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
|
||||
|
||||
[Accurate Mode]: 我/ 来到/ 北京/ 清华大学
|
||||
[Accurate Mode]: 我/ 来到/ 北京/ 清华大学
|
||||
|
||||
[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)
|
||||
[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)
|
||||
|
||||
[Search Engine Mode]: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在
|
||||
[Search Engine Mode]: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在
|
||||
, 日本, 京都, 大学, 日本京都大学, 深造
|
||||
|
||||
|
||||
@ -315,13 +331,13 @@ Function 2): Add a custom dictionary
|
||||
* The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space
|
||||
* Example:
|
||||
|
||||
云计算 5
|
||||
李小福 2
|
||||
创新办 3
|
||||
云计算 5
|
||||
李小福 2
|
||||
创新办 3
|
||||
|
||||
之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
|
||||
之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
|
||||
|
||||
加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
|
||||
加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
|
||||
|
||||
Function 3): Keyword Extraction
|
||||
================
|
||||
@ -331,7 +347,7 @@ Function 3): Keyword Extraction
|
||||
|
||||
Code sample (keyword extraction)
|
||||
|
||||
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
|
||||
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
|
||||
|
||||
Using Other Dictionaries
|
||||
========
|
||||
|
@ -1,10 +1,97 @@
|
||||
1号店 3 n
|
||||
1號店 3 n
|
||||
4S店 3 n
|
||||
4s店 3 n
|
||||
AA制 3 n
|
||||
AB型 3 n
|
||||
AT&T 3 nz
|
||||
A型 3 n
|
||||
A座 3 n
|
||||
A股 3 n
|
||||
A輪 3 n
|
||||
A轮 3 n
|
||||
BB机 3 n
|
||||
BB機 3 n
|
||||
BP机 3 n
|
||||
BP機 3 n
|
||||
B型 3 n
|
||||
B座 3 n
|
||||
B股 3 n
|
||||
B超 3 n
|
||||
B輪 3 n
|
||||
B轮 3 n
|
||||
C# 3 nz
|
||||
C++ 3 nz
|
||||
CALL机 3 n
|
||||
CALL機 3 n
|
||||
CD机 3 n
|
||||
CD機 3 n
|
||||
CD盒 3 n
|
||||
C座 3 n
|
||||
C盘 3 n
|
||||
C盤 3 n
|
||||
C語言 3 n
|
||||
C语言 3 n
|
||||
D座 3 n
|
||||
D版 3 n
|
||||
D盘 3 n
|
||||
D盤 3 n
|
||||
E化 3 n
|
||||
E座 3 n
|
||||
E盘 3 n
|
||||
E盤 3 n
|
||||
E通 3 n
|
||||
F座 3 n
|
||||
F盘 3 n
|
||||
F盤 3 n
|
||||
G盘 3 n
|
||||
G盤 3 n
|
||||
H盘 3 n
|
||||
H盤 3 n
|
||||
H股 3 n
|
||||
IC卡 3 n
|
||||
IP卡 3 n
|
||||
IP地址 3 n
|
||||
IP电话 3 n
|
||||
IP電話 3 n
|
||||
I盘 3 n
|
||||
I盤 3 n
|
||||
K党 3 n
|
||||
K歌之王 3 n
|
||||
K黨 3 n
|
||||
N年 3 n
|
||||
O型 3 n
|
||||
PC机 3 n
|
||||
PC機 3 n
|
||||
PH值 3 n
|
||||
QQ号 3 n
|
||||
QQ號 3 n
|
||||
Q版 3 n
|
||||
RSS訂閱 3 n
|
||||
RSS订阅 3 n
|
||||
SIM卡 3 n
|
||||
T台 3 n
|
||||
T型台 3 n
|
||||
T型臺 3 n
|
||||
T恤 4 n
|
||||
T恤衫 3 n
|
||||
T盘 3 n
|
||||
T盤 3 n
|
||||
T臺 3 n
|
||||
U盘 3 n
|
||||
U盤 3 n
|
||||
VISA卡 3 n
|
||||
X光 3 n
|
||||
X光線 3 n
|
||||
X光线 3 n
|
||||
X射線 3 n
|
||||
X射线 3 n
|
||||
Z盘 3 n
|
||||
Z盤 3 n
|
||||
c# 3 nz
|
||||
c++ 3 nz
|
||||
γ射線 3 n
|
||||
γ射线 3 n
|
||||
䰾 7 zg
|
||||
䲁 17 zg
|
||||
䴉 22 zg
|
||||
@ -147622,6 +147709,7 @@ c++ 3 nz
|
||||
夥犯 3 n
|
||||
夥計 496 n
|
||||
大 144099 a
|
||||
大S 3 nr
|
||||
大一岁 3 m
|
||||
大一歲 3 m
|
||||
大一統 76 d
|
||||
@ -177464,6 +177552,7 @@ c++ 3 nz
|
||||
導電體 3 n
|
||||
導體 290 n
|
||||
小 57969 a
|
||||
小S 3 nr
|
||||
小三 3 nr
|
||||
小三儿 3 nr
|
||||
小三兒 3 nr
|
||||
@ -202633,7 +202722,6 @@ c++ 3 nz
|
||||
张利胜 64 nr
|
||||
张剑寒 4 nr
|
||||
张副将 2 nr
|
||||
张力 160 nr
|
||||
张力峰 3 nr
|
||||
张力维 3 nr
|
||||
张力计 3 nr
|
||||
@ -202652,7 +202740,6 @@ c++ 3 nz
|
||||
张匡邺 2 nr
|
||||
张十五 3 nr
|
||||
张千英 28 nr
|
||||
张华 140 nr
|
||||
张华便 2 nr
|
||||
张华婧 5 nr
|
||||
张华康 2 nr
|
||||
@ -204231,7 +204318,6 @@ c++ 3 nz
|
||||
張副將 2 nr
|
||||
張劉陳 2 nr
|
||||
張劍寒 4 nr
|
||||
張力 160 nr
|
||||
張力峯 3 nr
|
||||
張力維 3 nr
|
||||
張力計 3 nr
|
||||
@ -205191,7 +205277,6 @@ c++ 3 nz
|
||||
張茂淵 6 nr
|
||||
張莉霞 5 nr
|
||||
張莊村 4 nr
|
||||
張華 140 nr
|
||||
張華便 2 nr
|
||||
張華婧 5 nr
|
||||
張華康 2 nr
|
||||
@ -312439,6 +312524,8 @@ c++ 3 nz
|
||||
江华县 2 ns
|
||||
江华瑶族自治县 7 ns
|
||||
江南 4986 ns
|
||||
江南Style 3 n
|
||||
江南style 3 n
|
||||
江南一带 3 nz
|
||||
江南一帶 3 nz
|
||||
江南七怪 3 nz
|
||||
@ -535280,6 +535367,7 @@ c++ 3 nz
|
||||
阽 2 g
|
||||
阽危之域 3 ns
|
||||
阿 6905 j
|
||||
阿Q 3 n
|
||||
阿丁枫 4 nr
|
||||
阿丁楓 4 nr
|
||||
阿七 8 ns
|
||||
|
@ -1,4 +1,4 @@
|
||||
__version__ = '0.31'
|
||||
__version__ = '0.32'
|
||||
__license__ = 'MIT'
|
||||
|
||||
import re
|
||||
@ -6,7 +6,6 @@ import os
|
||||
import sys
|
||||
from . import finalseg
|
||||
import time
|
||||
|
||||
import tempfile
|
||||
import marshal
|
||||
from math import log
|
||||
@ -24,12 +23,12 @@ total =0.0
|
||||
user_word_tag_tab={}
|
||||
initialized = False
|
||||
|
||||
|
||||
log_console = logging.StreamHandler(sys.stderr)
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.setLevel(logging.DEBUG)
|
||||
logger.addHandler(log_console)
|
||||
|
||||
|
||||
def setLogLevel(log_level):
|
||||
global logger
|
||||
logger.setLevel(log_level)
|
||||
@ -106,10 +105,9 @@ def initialize(*args):
|
||||
replace_file = os.rename
|
||||
replace_file(cache_file+tmp_suffix,cache_file)
|
||||
except:
|
||||
import traceback
|
||||
logger.error("dump cache file failed.")
|
||||
logger.exception("")
|
||||
#print(traceback.format_exc(),file=sys.stderr)
|
||||
|
||||
initialized = True
|
||||
|
||||
logger.debug("loading model cost %s seconds." % (time.time() - t1))
|
||||
@ -117,10 +115,10 @@ def initialize(*args):
|
||||
|
||||
|
||||
def require_initialized(fn):
|
||||
global initialized,DICTIONARY
|
||||
|
||||
@wraps(fn)
|
||||
def wrapped(*args, **kwargs):
|
||||
global initialized
|
||||
if initialized:
|
||||
return fn(*args, **kwargs)
|
||||
else:
|
||||
@ -179,6 +177,29 @@ def get_DAG(sentence):
|
||||
DAG[i] =[i]
|
||||
return DAG
|
||||
|
||||
def __cut_DAG_NO_HMM(sentence):
|
||||
re_eng = re.compile(r'[a-zA-Z0-9]',re.U)
|
||||
DAG = get_DAG(sentence)
|
||||
route ={}
|
||||
calc(sentence,DAG,0,route=route)
|
||||
x = 0
|
||||
N = len(sentence)
|
||||
buf = ''
|
||||
while x<N:
|
||||
y = route[x][1]+1
|
||||
l_word = sentence[x:y]
|
||||
if re_eng.match(l_word) and len(l_word)==1:
|
||||
buf += l_word
|
||||
x =y
|
||||
else:
|
||||
if len(buf)>0:
|
||||
yield buf
|
||||
buf = ''
|
||||
yield l_word
|
||||
x =y
|
||||
if len(buf)>0:
|
||||
yield buf
|
||||
buf = ''
|
||||
|
||||
def __cut_DAG(sentence):
|
||||
DAG = get_DAG(sentence)
|
||||
@ -221,21 +242,31 @@ def __cut_DAG(sentence):
|
||||
for elem in buf:
|
||||
yield elem
|
||||
|
||||
def cut(sentence,cut_all=False):
|
||||
def cut(sentence,cut_all=False,HMM=True):
|
||||
'''The main function that segments an entire sentence that contains
|
||||
Chinese characters into seperated words.
|
||||
Parameter:
|
||||
- sentence: The String to be segmented
|
||||
- cut_all: Model. True means full pattern, false means accurate pattern.
|
||||
- HMM: Whether use Hidden Markov Model.
|
||||
'''
|
||||
if isinstance(sentence, bytes):
|
||||
try:
|
||||
sentence = sentence.decode('utf-8')
|
||||
except UnicodeDecodeError:
|
||||
sentence = sentence.decode('gbk','ignore')
|
||||
|
||||
|
||||
re_han, re_skip = re.compile("([\u4E00-\u9FA5a-zA-Z0-9+#&\._]+)"), re.compile("(\r\n|\s)")
|
||||
|
||||
'''
|
||||
\\u4E00-\\u9FA5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
|
||||
\r\n|\s : whitespace characters. Will not be Handled.
|
||||
'''
|
||||
re_han, re_skip = re.compile(r"([\u4E00-\u9FA5a-zA-Z0-9+#&\._]+)", re.U), re.compile(r"(\r\n|\s)")
|
||||
if cut_all:
|
||||
re_han, re_skip = re.compile("([\u4E00-\u9FA5]+)"), re.compile("[^a-zA-Z0-9+#\n]")
|
||||
|
||||
re_han, re_skip = re.compile(r"([\u4E00-\u9FA5]+)", re.U), re.compile(r"[^a-zA-Z0-9+#\n]")
|
||||
blocks = re_han.split(sentence)
|
||||
cut_block = __cut_DAG
|
||||
if HMM:
|
||||
cut_block = __cut_DAG
|
||||
else:
|
||||
cut_block = __cut_DAG_NO_HMM
|
||||
if cut_all:
|
||||
cut_block = __cut_all
|
||||
for blk in blocks:
|
||||
@ -255,8 +286,8 @@ def cut(sentence,cut_all=False):
|
||||
else:
|
||||
yield x
|
||||
|
||||
def cut_for_search(sentence):
|
||||
words = cut(sentence)
|
||||
def cut_for_search(sentence,HMM=True):
|
||||
words = cut(sentence,HMM=HMM)
|
||||
for w in words:
|
||||
if len(w)>2:
|
||||
for i in range(len(w)-1):
|
||||
@ -272,8 +303,17 @@ def cut_for_search(sentence):
|
||||
|
||||
@require_initialized
|
||||
def load_userdict(f):
|
||||
''' Load personalized dict to improve detect rate.
|
||||
Parameter:
|
||||
- f : A plain text file contains words and their ocurrences.
|
||||
Structure of dict file:
|
||||
word1 freq1 word_type1
|
||||
word2 freq2 word_type2
|
||||
...
|
||||
Word type may be ignored
|
||||
'''
|
||||
global trie,total,FREQ
|
||||
if isinstance(f, (str, )):
|
||||
if isinstance(f, str):
|
||||
f = open(f, 'rb')
|
||||
content = f.read().decode('utf-8')
|
||||
line_no = 0
|
||||
@ -282,6 +322,7 @@ def load_userdict(f):
|
||||
if line.rstrip()=='': continue
|
||||
tup =line.split(" ")
|
||||
word,freq = tup[0],tup[1]
|
||||
if freq.isdigit() is False: continue
|
||||
if line_no==1:
|
||||
word = word.replace('\ufeff',"") #remove bom flag if it exists
|
||||
if len(tup)==3:
|
||||
@ -308,6 +349,8 @@ __ref_cut_for_search = cut_for_search
|
||||
|
||||
def __lcut(sentence):
|
||||
return list(__ref_cut(sentence,False))
|
||||
def __lcut_no_hmm(sentence):
|
||||
return list(__ref_cut(sentence,False,False))
|
||||
def __lcut_all(sentence):
|
||||
return list(__ref_cut(sentence,True))
|
||||
def __lcut_for_search(sentence):
|
||||
@ -326,18 +369,21 @@ def enable_parallel(processnum=None):
|
||||
processnum = cpu_count()
|
||||
pool = Pool(processnum)
|
||||
|
||||
def pcut(sentence,cut_all=False):
|
||||
parts = re.compile(b'([\r\n]+)').split(sentence)
|
||||
def pcut(sentence,cut_all=False,HMM=True):
|
||||
parts = re.compile('([\r\n]+)').split(sentence)
|
||||
if cut_all:
|
||||
result = pool.map(__lcut_all,parts)
|
||||
result = pool.map(__lcut_all,parts)
|
||||
else:
|
||||
result = pool.map(__lcut,parts)
|
||||
if HMM:
|
||||
result = pool.map(__lcut,parts)
|
||||
else:
|
||||
result = pool.map(__lcut_no_hmm,parts)
|
||||
for r in result:
|
||||
for w in r:
|
||||
yield w
|
||||
|
||||
def pcut_for_search(sentence):
|
||||
parts = re.compile(b'([\r\n]+)').split(sentence)
|
||||
parts = re.compile('([\r\n]+)').split(sentence)
|
||||
result = pool.map(__lcut_for_search,parts)
|
||||
for r in result:
|
||||
for w in r:
|
||||
@ -359,7 +405,7 @@ def set_dictionary(dictionary_path):
|
||||
with DICT_LOCK:
|
||||
abs_path = os.path.normpath( os.path.join( os.getcwd(), dictionary_path ) )
|
||||
if not os.path.exists(abs_path):
|
||||
raise Exception("jieba: path does not exists:" + abs_path)
|
||||
raise Exception("jieba: path does not exist:" + abs_path)
|
||||
DICTIONARY = abs_path
|
||||
initialized = False
|
||||
|
||||
@ -368,18 +414,18 @@ def get_abs_path_dict():
|
||||
abs_path = os.path.join(_curpath,DICTIONARY)
|
||||
return abs_path
|
||||
|
||||
def tokenize(unicode_sentence,mode="default"):
|
||||
def tokenize(unicode_sentence,mode="default",HMM=True):
|
||||
#mode ("default" or "search")
|
||||
if not isinstance(unicode_sentence, str):
|
||||
raise Exception("jieba: the input parameter should unicode.")
|
||||
raise Exception("jieba: the input parameter should be str.")
|
||||
start = 0
|
||||
if mode=='default':
|
||||
for w in cut(unicode_sentence):
|
||||
for w in cut(unicode_sentence,HMM=HMM):
|
||||
width = len(w)
|
||||
yield (w,start,start+width)
|
||||
start+=width
|
||||
else:
|
||||
for w in cut(unicode_sentence):
|
||||
for w in cut(unicode_sentence,HMM=HMM):
|
||||
width = len(w)
|
||||
if len(w)>2:
|
||||
for i in range(len(w)-1):
|
||||
@ -393,3 +439,4 @@ def tokenize(unicode_sentence,mode="default"):
|
||||
yield (gram3,start+i,start+i+3)
|
||||
yield (w,start,start+width)
|
||||
start+=width
|
||||
|
||||
|
@ -1,6 +1,5 @@
|
||||
import jieba
|
||||
import os
|
||||
|
||||
try:
|
||||
from .analyzer import ChineseAnalyzer
|
||||
except ImportError:
|
||||
|
@ -1,6 +1,7 @@
|
||||
#encoding=utf-8
|
||||
from whoosh.analysis import RegexAnalyzer,LowercaseFilter,StopFilter
|
||||
from whoosh.analysis import Tokenizer,Token
|
||||
from whoosh.analysis import RegexAnalyzer,LowercaseFilter,StopFilter,StemFilter
|
||||
from whoosh.analysis import Tokenizer,Token
|
||||
from whoosh.lang.porter import stem
|
||||
|
||||
import jieba
|
||||
import re
|
||||
@ -13,7 +14,6 @@ STOP_WORDS = frozenset(('a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'can',
|
||||
|
||||
accepted_chars = re.compile(r"[\u4E00-\u9FA5]+")
|
||||
|
||||
|
||||
class ChineseTokenizer(Tokenizer):
|
||||
def __call__(self,text,**kargs):
|
||||
words = jieba.tokenize(text,mode="search")
|
||||
@ -30,5 +30,6 @@ class ChineseTokenizer(Tokenizer):
|
||||
token.endchar = stop_pos
|
||||
yield token
|
||||
|
||||
def ChineseAnalyzer(stoplist=STOP_WORDS,minsize=1):
|
||||
return ChineseTokenizer() | LowercaseFilter() | StopFilter(stoplist=stoplist,minsize=minsize)
|
||||
def ChineseAnalyzer(stoplist=STOP_WORDS,minsize=1,stemfn=stem,cachesize=50000):
|
||||
return ChineseTokenizer() | LowercaseFilter() | StopFilter(stoplist=stoplist,minsize=minsize)\
|
||||
|StemFilter(stemfn=stemfn, ignore=None,cachesize=cachesize)
|
||||
|
@ -62,6 +62,8 @@ X射线 3 n
|
||||
T恤衫 3 n
|
||||
T型台 3 n
|
||||
T台 3 n
|
||||
4S店 3 n
|
||||
4s店 3 n
|
||||
江南style 3 n
|
||||
江南Style 3 n
|
||||
1号店 3 n
|
||||
@ -76270,6 +76272,7 @@ T台 3 n
|
||||
吉林人民出版社 3 nt
|
||||
吉林大学 34 nt
|
||||
吉林工业大学 3 nt
|
||||
吉林 89 ns
|
||||
吉林市 90 ns
|
||||
吉林敖东 4 nr
|
||||
吉林省 424 ns
|
||||
|
@ -25,13 +25,13 @@ def load_model():
|
||||
with open(abs_path, mode='rb') as f:
|
||||
start_p = marshal.load(f)
|
||||
f.closed
|
||||
|
||||
|
||||
trans_p = {}
|
||||
abs_path = os.path.join(_curpath, PROB_TRANS_P)
|
||||
with open(abs_path, 'rb') as f:
|
||||
trans_p = marshal.load(f)
|
||||
f.closed
|
||||
|
||||
|
||||
emit_p = {}
|
||||
abs_path = os.path.join(_curpath, PROB_EMIT_P)
|
||||
with open(abs_path, 'rb') as f:
|
||||
@ -61,9 +61,9 @@ def viterbi(obs, states, start_p, trans_p, emit_p):
|
||||
V[t][y] =prob
|
||||
newpath[y] = path[state] + [y]
|
||||
path = newpath
|
||||
|
||||
|
||||
(prob, state) = max([(V[len(obs) - 1][y], y) for y in ('E','S')])
|
||||
|
||||
|
||||
return (prob, path[state])
|
||||
|
||||
|
||||
@ -91,9 +91,7 @@ def cut(sentence):
|
||||
sentence = sentence.decode('utf-8')
|
||||
except:
|
||||
sentence = sentence.decode('gbk','ignore')
|
||||
|
||||
re_han, re_skip = re.compile("([\u4E00-\u9FA5]+)"), re.compile("(\d+\.\d+|[a-zA-Z0-9]+)")
|
||||
|
||||
re_han, re_skip = re.compile(r"([\u4E00-\u9FA5]+)"), re.compile(r"(\d+\.\d+|[a-zA-Z0-9]+)")
|
||||
blocks = re_han.split(sentence)
|
||||
for blk in blocks:
|
||||
if re_han.match(blk):
|
||||
|
File diff suppressed because it is too large
Load Diff
@ -4,6 +4,7 @@ from . import viterbi
|
||||
import jieba
|
||||
import sys
|
||||
import marshal
|
||||
from functools import wraps
|
||||
|
||||
default_encoding = sys.getfilesystemencoding()
|
||||
|
||||
@ -26,19 +27,19 @@ def load_model(f_name,isJython=True):
|
||||
f.closed
|
||||
if not isJython:
|
||||
return result
|
||||
|
||||
|
||||
start_p = {}
|
||||
abs_path = os.path.join(_curpath, PROB_START_P)
|
||||
with open(abs_path, mode='rb') as f:
|
||||
start_p = marshal.load(f)
|
||||
f.closed
|
||||
|
||||
|
||||
trans_p = {}
|
||||
abs_path = os.path.join(_curpath, PROB_TRANS_P)
|
||||
with open(abs_path, 'rb') as f:
|
||||
trans_p = marshal.load(f)
|
||||
f.closed
|
||||
|
||||
|
||||
emit_p = {}
|
||||
abs_path = os.path.join(_curpath, PROB_EMIT_P)
|
||||
with open(abs_path, 'rb') as f:
|
||||
@ -60,8 +61,16 @@ else:
|
||||
char_state_tab_P, start_P, trans_P, emit_P = char_state_tab.P, prob_start.P, prob_trans.P, prob_emit.P
|
||||
word_tag_tab = load_model(jieba.get_abs_path_dict(),isJython=False)
|
||||
|
||||
if jieba.user_word_tag_tab:
|
||||
word_tag_tab.update(jieba.user_word_tag_tab)
|
||||
def makesure_userdict_loaded(fn):
|
||||
|
||||
@wraps(fn)
|
||||
def wrapped(*args,**kwargs):
|
||||
if len(jieba.user_word_tag_tab)>0:
|
||||
word_tag_tab.update(jieba.user_word_tag_tab)
|
||||
jieba.user_word_tag_tab = {}
|
||||
return fn(*args,**kwargs)
|
||||
|
||||
return wrapped
|
||||
|
||||
class pair(object):
|
||||
def __init__(self,word,flag):
|
||||
@ -98,15 +107,13 @@ def __cut(sentence):
|
||||
yield pair(sentence[next:], pos_list[next][1] )
|
||||
|
||||
def __cut_detail(sentence):
|
||||
|
||||
re_han, re_skip = re.compile("([\u4E00-\u9FA5]+)"), re.compile("([\.0-9]+|[a-zA-Z0-9]+)")
|
||||
re_eng,re_num = re.compile("[a-zA-Z0-9]+"), re.compile("[\.0-9]+")
|
||||
|
||||
re_han, re_skip = re.compile(r"([\u4E00-\u9FA5]+)"), re.compile(r"([\.0-9]+|[a-zA-Z0-9]+)")
|
||||
re_eng,re_num = re.compile(r"[a-zA-Z0-9]+"), re.compile(r"[\.0-9]+")
|
||||
blocks = re_han.split(sentence)
|
||||
for blk in blocks:
|
||||
if re_han.match(blk):
|
||||
for word in __cut(blk):
|
||||
yield word
|
||||
for word in __cut(blk):
|
||||
yield word
|
||||
else:
|
||||
tmp = re_skip.split(blk)
|
||||
for x in tmp:
|
||||
@ -118,10 +125,34 @@ def __cut_detail(sentence):
|
||||
else:
|
||||
yield pair(x,'x')
|
||||
|
||||
def __cut_DAG_NO_HMM(sentence):
|
||||
DAG = jieba.get_DAG(sentence)
|
||||
route ={}
|
||||
jieba.calc(sentence,DAG,0,route=route)
|
||||
x = 0
|
||||
N = len(sentence)
|
||||
buf =''
|
||||
re_eng = re.compile(r'[a-zA-Z0-9]',re.U)
|
||||
while x<N:
|
||||
y = route[x][1]+1
|
||||
l_word = sentence[x:y]
|
||||
if re_eng.match(l_word) and len(l_word)==1:
|
||||
buf += l_word
|
||||
x = y
|
||||
else:
|
||||
if len(buf)>0:
|
||||
yield pair(buf,'eng')
|
||||
buf = ''
|
||||
yield pair(l_word,word_tag_tab.get(l_word,'x'))
|
||||
x =y
|
||||
if len(buf)>0:
|
||||
yield pair(buf,'eng')
|
||||
buf = ''
|
||||
|
||||
def __cut_DAG(sentence):
|
||||
DAG = jieba.get_DAG(sentence)
|
||||
route ={}
|
||||
|
||||
|
||||
jieba.calc(sentence,DAG,0,route=route)
|
||||
|
||||
x = 0
|
||||
@ -161,21 +192,24 @@ def __cut_DAG(sentence):
|
||||
for elem in buf:
|
||||
yield pair(elem,word_tag_tab.get(elem,'x'))
|
||||
|
||||
def __cut_internal(sentence):
|
||||
def __cut_internal(sentence,HMM=True):
|
||||
if not isinstance(sentence, str):
|
||||
try:
|
||||
sentence = sentence.decode('utf-8')
|
||||
except:
|
||||
sentence = sentence.decode('gbk','ignore')
|
||||
|
||||
re_han, re_skip = re.compile("([\u4E00-\u9FA5a-zA-Z0-9+#&\._]+)"), re.compile("(\r\n|\s)")
|
||||
re_eng,re_num = re.compile("[a-zA-Z0-9]+"), re.compile("[\.0-9]+")
|
||||
|
||||
re_han, re_skip = re.compile(r"([\u4E00-\u9FA5a-zA-Z0-9+#&\._]+)"), re.compile(r"(\r\n|\s)")
|
||||
re_eng,re_num = re.compile(r"[a-zA-Z0-9]+"), re.compile(r"[\.0-9]+")
|
||||
blocks = re_han.split(sentence)
|
||||
if HMM:
|
||||
__cut_blk = __cut_DAG
|
||||
else:
|
||||
__cut_blk = __cut_DAG_NO_HMM
|
||||
|
||||
for blk in blocks:
|
||||
if re_han.match(blk):
|
||||
for word in __cut_DAG(blk):
|
||||
yield word
|
||||
for word in __cut_blk(blk):
|
||||
yield word
|
||||
else:
|
||||
tmp = re_skip.split(blk)
|
||||
for x in tmp:
|
||||
@ -192,14 +226,21 @@ def __cut_internal(sentence):
|
||||
|
||||
def __lcut_internal(sentence):
|
||||
return list(__cut_internal(sentence))
|
||||
def __lcut_internal_no_hmm(sentence):
|
||||
return list(__cut_internal(sentence,False))
|
||||
|
||||
def cut(sentence):
|
||||
|
||||
@makesure_userdict_loaded
|
||||
def cut(sentence,HMM=True):
|
||||
if (not hasattr(jieba,'pool')) or (jieba.pool==None):
|
||||
for w in __cut_internal(sentence):
|
||||
for w in __cut_internal(sentence,HMM=HMM):
|
||||
yield w
|
||||
else:
|
||||
parts = re.compile('([\r\n]+)').split(sentence)
|
||||
result = jieba.pool.map(__lcut_internal,parts)
|
||||
if HMM:
|
||||
result = jieba.pool.map(__lcut_internal,parts)
|
||||
else:
|
||||
result = jieba.pool.map(__lcut_internal_no_hmm,parts)
|
||||
for r in result:
|
||||
for w in r:
|
||||
yield w
|
||||
|
File diff suppressed because it is too large
Load Diff
178554
jieba/posseg/prob_emit.py
178554
jieba/posseg/prob_emit.py
File diff suppressed because it is too large
Load Diff
@ -1,5 +1,6 @@
|
||||
import operator
|
||||
MIN_FLOAT=-3.14e100
|
||||
MIN_INF=float("-inf")
|
||||
|
||||
def get_top_states(t_state_v,K=4):
|
||||
items = t_state_v.items()
|
||||
@ -16,16 +17,18 @@ def viterbi(obs, states, start_p, trans_p, emit_p):
|
||||
for t in range(1,len(obs)):
|
||||
V.append({})
|
||||
mem_path.append({})
|
||||
prev_states = get_top_states(V[t-1])
|
||||
#prev_states = get_top_states(V[t-1])
|
||||
prev_states =[ x for x in mem_path[t-1].keys() if len(trans_p[x])>0 ]
|
||||
|
||||
prev_states_expect_next = set( (y for x in prev_states for y in trans_p[x].keys() ) )
|
||||
obs_states = states.get(obs[t],all_states)
|
||||
obs_states = set(obs_states) & set(prev_states_expect_next)
|
||||
|
||||
if len(obs_states)==0: obs_states = prev_states_expect_next
|
||||
if len(obs_states)==0: obs_states = all_states
|
||||
|
||||
for y in obs_states:
|
||||
(prob,state ) = max([(V[t-1][y0] + trans_p[y0].get(y,MIN_FLOAT) + emit_p[y].get(obs[t],MIN_FLOAT) ,y0) for y0 in prev_states])
|
||||
(prob,state ) = max([(V[t-1][y0] + trans_p[y0].get(y,MIN_INF) + emit_p[y].get(obs[t],MIN_FLOAT) ,y0) for y0 in prev_states])
|
||||
V[t][y] =prob
|
||||
mem_path[t][y] = state
|
||||
|
||||
|
2
setup.py
2
setup.py
@ -1,6 +1,6 @@
|
||||
from distutils.core import setup
|
||||
setup(name='jieba',
|
||||
version='0.31',
|
||||
version='0.32',
|
||||
description='Chinese Words Segementation Utilities',
|
||||
author='Sun, Junyi',
|
||||
author_email='ccnusjy@gmail.com',
|
||||
|
@ -28,5 +28,3 @@ content = open(file_name, 'rb').read()
|
||||
tags = jieba.analyse.extract_tags(content, topK=topK)
|
||||
|
||||
print(",".join(tags))
|
||||
|
||||
|
||||
|
@ -109,8 +109,8 @@ class JiebaTestCase(unittest.TestCase):
|
||||
assert isinstance(result, types.GeneratorType), "Test DefaultCut Generator error"
|
||||
result = list(result)
|
||||
assert isinstance(result, list), "Test DefaultCut error on content: %s" % content
|
||||
print(" , ".join(result),file=sys.stderr)
|
||||
print("testDefaultCut",file=sys.stderr)
|
||||
print(" , ".join(result), file=sys.stderr)
|
||||
print("testDefaultCut", file=sys.stderr)
|
||||
|
||||
def testCutAll(self):
|
||||
for content in test_contents:
|
||||
@ -119,7 +119,7 @@ class JiebaTestCase(unittest.TestCase):
|
||||
result = list(result)
|
||||
assert isinstance(result, list), "Test CutAll error on content: %s" % content
|
||||
print(" , ".join(result), file=sys.stderr)
|
||||
print("testCutAll",file=sys.stderr)
|
||||
print("testCutAll", file=sys.stderr)
|
||||
|
||||
def testSetDictionary(self):
|
||||
jieba.set_dictionary("foobar.txt")
|
||||
@ -129,7 +129,7 @@ class JiebaTestCase(unittest.TestCase):
|
||||
result = list(result)
|
||||
assert isinstance(result, list), "Test SetDictionary error on content: %s" % content
|
||||
print(" , ".join(result), file=sys.stderr)
|
||||
print("testSetDictionary",file=sys.stderr)
|
||||
print("testSetDictionary", file=sys.stderr)
|
||||
|
||||
def testCutForSearch(self):
|
||||
for content in test_contents:
|
||||
@ -138,7 +138,7 @@ class JiebaTestCase(unittest.TestCase):
|
||||
result = list(result)
|
||||
assert isinstance(result, list), "Test CutForSearch error on content: %s" % content
|
||||
print(" , ".join(result), file=sys.stderr)
|
||||
print("testCutForSearch",file=sys.stderr)
|
||||
print("testCutForSearch", file=sys.stderr)
|
||||
|
||||
def testPosseg(self):
|
||||
import jieba.posseg as pseg
|
||||
@ -147,8 +147,8 @@ class JiebaTestCase(unittest.TestCase):
|
||||
assert isinstance(result, types.GeneratorType), "Test Posseg Generator error"
|
||||
result = list(result)
|
||||
assert isinstance(result, list), "Test Posseg error on content: %s" % content
|
||||
print(" , ".join([w.word + " / " + w.flag for w in result]),file=sys.stderr)
|
||||
print("testPosseg",file=sys.stderr)
|
||||
print(" , ".join([w.word + " / " + w.flag for w in result]), file=sys.stderr)
|
||||
print("testPosseg", file=sys.stderr)
|
||||
|
||||
def testTokenize(self):
|
||||
for content in test_contents:
|
||||
@ -158,7 +158,45 @@ class JiebaTestCase(unittest.TestCase):
|
||||
assert isinstance(result, list), "Test Tokenize error on content: %s" % content
|
||||
for tk in result:
|
||||
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]), file=sys.stderr)
|
||||
print("testTokenize",file=sys.stderr)
|
||||
print("testTokenize", file=sys.stderr)
|
||||
|
||||
def testDefaultCut_NOHMM(self):
|
||||
for content in test_contents:
|
||||
result = jieba.cut(content,HMM=False)
|
||||
assert isinstance(result, types.GeneratorType), "Test DefaultCut Generator error"
|
||||
result = list(result)
|
||||
assert isinstance(result, list), "Test DefaultCut error on content: %s" % content
|
||||
print(" , ".join(result), file=sys.stderr)
|
||||
print("testDefaultCut_NOHMM", file=sys.stderr)
|
||||
|
||||
def testPosseg_NOHMM(self):
|
||||
import jieba.posseg as pseg
|
||||
for content in test_contents:
|
||||
result = pseg.cut(content,HMM=False)
|
||||
assert isinstance(result, types.GeneratorType), "Test Posseg Generator error"
|
||||
result = list(result)
|
||||
assert isinstance(result, list), "Test Posseg error on content: %s" % content
|
||||
print(" , ".join([w.word + " / " + w.flag for w in result]), file=sys.stderr)
|
||||
print("testPosseg_NOHMM", file=sys.stderr)
|
||||
|
||||
def testTokenize_NOHMM(self):
|
||||
for content in test_contents:
|
||||
result = jieba.tokenize(content,HMM=False)
|
||||
assert isinstance(result, types.GeneratorType), "Test Tokenize Generator error"
|
||||
result = list(result)
|
||||
assert isinstance(result, list), "Test Tokenize error on content: %s" % content
|
||||
for tk in result:
|
||||
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]), file=sys.stderr)
|
||||
print("testTokenize_NOHMM", file=sys.stderr)
|
||||
|
||||
def testCutForSearch_NOHMM(self):
|
||||
for content in test_contents:
|
||||
result = jieba.cut_for_search(content,HMM=False)
|
||||
assert isinstance(result, types.GeneratorType), "Test CutForSearch Generator error"
|
||||
result = list(result)
|
||||
assert isinstance(result, list), "Test CutForSearch error on content: %s" % content
|
||||
print(" , ".join(result), file=sys.stderr)
|
||||
print("testCutForSearch_NOHMM", file=sys.stderr)
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
9
test/test_bug.py
Normal file
9
test/test_bug.py
Normal file
@ -0,0 +1,9 @@
|
||||
#encoding=utf-8
|
||||
import sys
|
||||
sys.path.append("../")
|
||||
import jieba
|
||||
import jieba.posseg as pseg
|
||||
words=pseg.cut("又跛又啞")
|
||||
for w in words:
|
||||
print(w.word,w.flag)
|
||||
|
100
test/test_no_hmm.py
Normal file
100
test/test_no_hmm.py
Normal file
@ -0,0 +1,100 @@
|
||||
#encoding=utf-8
|
||||
import sys
|
||||
sys.path.append("../")
|
||||
import jieba
|
||||
|
||||
|
||||
def cuttest(test_sent):
|
||||
result = jieba.cut(test_sent,HMM=False)
|
||||
print(" / ".join(result))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。")
|
||||
cuttest("我不喜欢日本和服。")
|
||||
cuttest("雷猴回归人间。")
|
||||
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
|
||||
cuttest("我需要廉租房")
|
||||
cuttest("永和服装饰品有限公司")
|
||||
cuttest("我爱北京天安门")
|
||||
cuttest("abc")
|
||||
cuttest("隐马尔可夫")
|
||||
cuttest("雷猴是个好网站")
|
||||
cuttest("“Microsoft”一词由“MICROcomputer(微型计算机)”和“SOFTware(软件)”两部分组成")
|
||||
cuttest("草泥马和欺实马是今年的流行词汇")
|
||||
cuttest("伊藤洋华堂总府店")
|
||||
cuttest("中国科学院计算技术研究所")
|
||||
cuttest("罗密欧与朱丽叶")
|
||||
cuttest("我购买了道具和服装")
|
||||
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
|
||||
cuttest("湖北省石首市")
|
||||
cuttest("湖北省十堰市")
|
||||
cuttest("总经理完成了这件事情")
|
||||
cuttest("电脑修好了")
|
||||
cuttest("做好了这件事情就一了百了了")
|
||||
cuttest("人们审美的观点是不同的")
|
||||
cuttest("我们买了一个美的空调")
|
||||
cuttest("线程初始化时我们要注意")
|
||||
cuttest("一个分子是由好多原子组织成的")
|
||||
cuttest("祝你马到功成")
|
||||
cuttest("他掉进了无底洞里")
|
||||
cuttest("中国的首都是北京")
|
||||
cuttest("孙君意")
|
||||
cuttest("外交部发言人马朝旭")
|
||||
cuttest("领导人会议和第四届东亚峰会")
|
||||
cuttest("在过去的这五年")
|
||||
cuttest("还需要很长的路要走")
|
||||
cuttest("60周年首都阅兵")
|
||||
cuttest("你好人们审美的观点是不同的")
|
||||
cuttest("买水果然后来世博园")
|
||||
cuttest("买水果然后去世博园")
|
||||
cuttest("但是后来我才知道你是对的")
|
||||
cuttest("存在即合理")
|
||||
cuttest("的的的的的在的的的的就以和和和")
|
||||
cuttest("I love你,不以为耻,反以为rong")
|
||||
cuttest("因")
|
||||
cuttest("")
|
||||
cuttest("hello你好人们审美的观点是不同的")
|
||||
cuttest("很好但主要是基于网页形式")
|
||||
cuttest("hello你好人们审美的观点是不同的")
|
||||
cuttest("为什么我不能拥有想要的生活")
|
||||
cuttest("后来我才")
|
||||
cuttest("此次来中国是为了")
|
||||
cuttest("使用了它就可以解决一些问题")
|
||||
cuttest(",使用了它就可以解决一些问题")
|
||||
cuttest("其实使用了它就可以解决一些问题")
|
||||
cuttest("好人使用了它就可以解决一些问题")
|
||||
cuttest("是因为和国家")
|
||||
cuttest("老年搜索还支持")
|
||||
cuttest("干脆就把那部蒙人的闲法给废了拉倒!RT @laoshipukong : 27日,全国人大常委会第三次审议侵权责任法草案,删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
|
||||
cuttest("大")
|
||||
cuttest("")
|
||||
cuttest("他说的确实在理")
|
||||
cuttest("长春市长春节讲话")
|
||||
cuttest("结婚的和尚未结婚的")
|
||||
cuttest("结合成分子时")
|
||||
cuttest("旅游和服务是最好的")
|
||||
cuttest("这件事情的确是我的错")
|
||||
cuttest("供大家参考指正")
|
||||
cuttest("哈尔滨政府公布塌桥原因")
|
||||
cuttest("我在机场入口处")
|
||||
cuttest("邢永臣摄影报道")
|
||||
cuttest("BP神经网络如何训练才能在分类时增加区分度?")
|
||||
cuttest("南京市长江大桥")
|
||||
cuttest("应一些使用者的建议,也为了便于利用NiuTrans用于SMT研究")
|
||||
cuttest('长春市长春药店')
|
||||
cuttest('邓颖超生前最喜欢的衣服')
|
||||
cuttest('胡锦涛是热爱世界和平的政治局常委')
|
||||
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
|
||||
cuttest('一次性交多少钱')
|
||||
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
|
||||
cuttest('小和尚留了一个像大和尚一样的和尚头')
|
||||
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
|
||||
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
|
||||
cuttest('AT&T是一件不错的公司,给你发offer了吗?')
|
||||
cuttest('C++和c#是什么关系?11+122=133,是吗?PI=3.14159')
|
||||
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
|
||||
cuttest('枪杆子中出政权')
|
||||
cuttest('张三风同学走上了不归路')
|
||||
cuttest('阿Q腰间挂着BB机手里拿着大哥大,说:我一般吃饭不AA制的。')
|
||||
cuttest('在1号店能买到小S和大S八卦的书,还有3D电视。')
|
@ -95,3 +95,4 @@ if __name__ == "__main__":
|
||||
cuttest('AT&T是一件不错的公司,给你发offer了吗?')
|
||||
cuttest('C++和c#是什么关系?11+122=133,是吗?PI=3.14159')
|
||||
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
|
||||
cuttest('枪杆子中出政权')
|
98
test/test_pos_no_hmm.py
Normal file
98
test/test_pos_no_hmm.py
Normal file
@ -0,0 +1,98 @@
|
||||
#encoding=utf-8
|
||||
import sys
|
||||
sys.path.append("../")
|
||||
import jieba.posseg as pseg
|
||||
|
||||
def cuttest(test_sent):
|
||||
result = pseg.cut(test_sent,HMM=False)
|
||||
for w in result:
|
||||
print(w.word, "/", w.flag, ", ", end=' ')
|
||||
print("")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。")
|
||||
cuttest("我不喜欢日本和服。")
|
||||
cuttest("雷猴回归人间。")
|
||||
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
|
||||
cuttest("我需要廉租房")
|
||||
cuttest("永和服装饰品有限公司")
|
||||
cuttest("我爱北京天安门")
|
||||
cuttest("abc")
|
||||
cuttest("隐马尔可夫")
|
||||
cuttest("雷猴是个好网站")
|
||||
cuttest("“Microsoft”一词由“MICROcomputer(微型计算机)”和“SOFTware(软件)”两部分组成")
|
||||
cuttest("草泥马和欺实马是今年的流行词汇")
|
||||
cuttest("伊藤洋华堂总府店")
|
||||
cuttest("中国科学院计算技术研究所")
|
||||
cuttest("罗密欧与朱丽叶")
|
||||
cuttest("我购买了道具和服装")
|
||||
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
|
||||
cuttest("湖北省石首市")
|
||||
cuttest("湖北省十堰市")
|
||||
cuttest("总经理完成了这件事情")
|
||||
cuttest("电脑修好了")
|
||||
cuttest("做好了这件事情就一了百了了")
|
||||
cuttest("人们审美的观点是不同的")
|
||||
cuttest("我们买了一个美的空调")
|
||||
cuttest("线程初始化时我们要注意")
|
||||
cuttest("一个分子是由好多原子组织成的")
|
||||
cuttest("祝你马到功成")
|
||||
cuttest("他掉进了无底洞里")
|
||||
cuttest("中国的首都是北京")
|
||||
cuttest("孙君意")
|
||||
cuttest("外交部发言人马朝旭")
|
||||
cuttest("领导人会议和第四届东亚峰会")
|
||||
cuttest("在过去的这五年")
|
||||
cuttest("还需要很长的路要走")
|
||||
cuttest("60周年首都阅兵")
|
||||
cuttest("你好人们审美的观点是不同的")
|
||||
cuttest("买水果然后来世博园")
|
||||
cuttest("买水果然后去世博园")
|
||||
cuttest("但是后来我才知道你是对的")
|
||||
cuttest("存在即合理")
|
||||
cuttest("的的的的的在的的的的就以和和和")
|
||||
cuttest("I love你,不以为耻,反以为rong")
|
||||
cuttest("因")
|
||||
cuttest("")
|
||||
cuttest("hello你好人们审美的观点是不同的")
|
||||
cuttest("很好但主要是基于网页形式")
|
||||
cuttest("hello你好人们审美的观点是不同的")
|
||||
cuttest("为什么我不能拥有想要的生活")
|
||||
cuttest("后来我才")
|
||||
cuttest("此次来中国是为了")
|
||||
cuttest("使用了它就可以解决一些问题")
|
||||
cuttest(",使用了它就可以解决一些问题")
|
||||
cuttest("其实使用了它就可以解决一些问题")
|
||||
cuttest("好人使用了它就可以解决一些问题")
|
||||
cuttest("是因为和国家")
|
||||
cuttest("老年搜索还支持")
|
||||
cuttest("干脆就把那部蒙人的闲法给废了拉倒!RT @laoshipukong : 27日,全国人大常委会第三次审议侵权责任法草案,删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
|
||||
cuttest("大")
|
||||
cuttest("")
|
||||
cuttest("他说的确实在理")
|
||||
cuttest("长春市长春节讲话")
|
||||
cuttest("结婚的和尚未结婚的")
|
||||
cuttest("结合成分子时")
|
||||
cuttest("旅游和服务是最好的")
|
||||
cuttest("这件事情的确是我的错")
|
||||
cuttest("供大家参考指正")
|
||||
cuttest("哈尔滨政府公布塌桥原因")
|
||||
cuttest("我在机场入口处")
|
||||
cuttest("邢永臣摄影报道")
|
||||
cuttest("BP神经网络如何训练才能在分类时增加区分度?")
|
||||
cuttest("南京市长江大桥")
|
||||
cuttest("应一些使用者的建议,也为了便于利用NiuTrans用于SMT研究")
|
||||
cuttest('长春市长春药店')
|
||||
cuttest('邓颖超生前最喜欢的衣服')
|
||||
cuttest('胡锦涛是热爱世界和平的政治局常委')
|
||||
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
|
||||
cuttest('一次性交多少钱')
|
||||
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
|
||||
cuttest('小和尚留了一个像大和尚一样的和尚头')
|
||||
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
|
||||
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
|
||||
cuttest('AT&T是一件不错的公司,给你发offer了吗?')
|
||||
cuttest('C++和c#是什么关系?11+122=133,是吗?PI=3.14159')
|
||||
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
|
||||
cuttest('枪杆子中出政权')
|
105
test/test_tokenize_no_hmm.py
Normal file
105
test/test_tokenize_no_hmm.py
Normal file
@ -0,0 +1,105 @@
|
||||
#encoding=utf-8
|
||||
import sys
|
||||
sys.path.append("../")
|
||||
import jieba
|
||||
|
||||
g_mode="default"
|
||||
|
||||
def cuttest(test_sent):
|
||||
global g_mode
|
||||
result = jieba.tokenize(test_sent,mode=g_mode,HMM=False)
|
||||
for tk in result:
|
||||
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
for m in ("default","search"):
|
||||
g_mode = m
|
||||
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。")
|
||||
cuttest("我不喜欢日本和服。")
|
||||
cuttest("雷猴回归人间。")
|
||||
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
|
||||
cuttest("我需要廉租房")
|
||||
cuttest("永和服装饰品有限公司")
|
||||
cuttest("我爱北京天安门")
|
||||
cuttest("abc")
|
||||
cuttest("隐马尔可夫")
|
||||
cuttest("雷猴是个好网站")
|
||||
cuttest("“Microsoft”一词由“MICROcomputer(微型计算机)”和“SOFTware(软件)”两部分组成")
|
||||
cuttest("草泥马和欺实马是今年的流行词汇")
|
||||
cuttest("伊藤洋华堂总府店")
|
||||
cuttest("中国科学院计算技术研究所")
|
||||
cuttest("罗密欧与朱丽叶")
|
||||
cuttest("我购买了道具和服装")
|
||||
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
|
||||
cuttest("湖北省石首市")
|
||||
cuttest("湖北省十堰市")
|
||||
cuttest("总经理完成了这件事情")
|
||||
cuttest("电脑修好了")
|
||||
cuttest("做好了这件事情就一了百了了")
|
||||
cuttest("人们审美的观点是不同的")
|
||||
cuttest("我们买了一个美的空调")
|
||||
cuttest("线程初始化时我们要注意")
|
||||
cuttest("一个分子是由好多原子组织成的")
|
||||
cuttest("祝你马到功成")
|
||||
cuttest("他掉进了无底洞里")
|
||||
cuttest("中国的首都是北京")
|
||||
cuttest("孙君意")
|
||||
cuttest("外交部发言人马朝旭")
|
||||
cuttest("领导人会议和第四届东亚峰会")
|
||||
cuttest("在过去的这五年")
|
||||
cuttest("还需要很长的路要走")
|
||||
cuttest("60周年首都阅兵")
|
||||
cuttest("你好人们审美的观点是不同的")
|
||||
cuttest("买水果然后来世博园")
|
||||
cuttest("买水果然后去世博园")
|
||||
cuttest("但是后来我才知道你是对的")
|
||||
cuttest("存在即合理")
|
||||
cuttest("的的的的的在的的的的就以和和和")
|
||||
cuttest("I love你,不以为耻,反以为rong")
|
||||
cuttest("因")
|
||||
cuttest("")
|
||||
cuttest("hello你好人们审美的观点是不同的")
|
||||
cuttest("很好但主要是基于网页形式")
|
||||
cuttest("hello你好人们审美的观点是不同的")
|
||||
cuttest("为什么我不能拥有想要的生活")
|
||||
cuttest("后来我才")
|
||||
cuttest("此次来中国是为了")
|
||||
cuttest("使用了它就可以解决一些问题")
|
||||
cuttest(",使用了它就可以解决一些问题")
|
||||
cuttest("其实使用了它就可以解决一些问题")
|
||||
cuttest("好人使用了它就可以解决一些问题")
|
||||
cuttest("是因为和国家")
|
||||
cuttest("老年搜索还支持")
|
||||
cuttest("干脆就把那部蒙人的闲法给废了拉倒!RT @laoshipukong : 27日,全国人大常委会第三次审议侵权责任法草案,删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
|
||||
cuttest("大")
|
||||
cuttest("")
|
||||
cuttest("他说的确实在理")
|
||||
cuttest("长春市长春节讲话")
|
||||
cuttest("结婚的和尚未结婚的")
|
||||
cuttest("结合成分子时")
|
||||
cuttest("旅游和服务是最好的")
|
||||
cuttest("这件事情的确是我的错")
|
||||
cuttest("供大家参考指正")
|
||||
cuttest("哈尔滨政府公布塌桥原因")
|
||||
cuttest("我在机场入口处")
|
||||
cuttest("邢永臣摄影报道")
|
||||
cuttest("BP神经网络如何训练才能在分类时增加区分度?")
|
||||
cuttest("南京市长江大桥")
|
||||
cuttest("应一些使用者的建议,也为了便于利用NiuTrans用于SMT研究")
|
||||
cuttest('长春市长春药店')
|
||||
cuttest('邓颖超生前最喜欢的衣服')
|
||||
cuttest('胡锦涛是热爱世界和平的政治局常委')
|
||||
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
|
||||
cuttest('一次性交多少钱')
|
||||
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
|
||||
cuttest('小和尚留了一个像大和尚一样的和尚头')
|
||||
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
|
||||
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
|
||||
cuttest('AT&T是一件不错的公司,给你发offer了吗?')
|
||||
cuttest('C++和c#是什么关系?11+122=133,是吗?PI=3.14159')
|
||||
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
|
||||
cuttest('枪杆子中出政权')
|
||||
cuttest('张三风同学走上了不归路')
|
||||
cuttest('阿Q腰间挂着BB机手里拿着大哥大,说:我一般吃饭不AA制的。')
|
||||
cuttest('在1号店能买到小S和大S八卦的书。')
|
@ -5,7 +5,7 @@ import jieba
|
||||
jieba.load_userdict("userdict.txt")
|
||||
import jieba.posseg as pseg
|
||||
|
||||
test_sent = "李小福是创新办主任也是云计算方面的专家;"
|
||||
test_sent = "李小福是创新办主任也是云计算方面的专家; 什么是八一双鹿"
|
||||
test_sent += "例如我输入一个带“韩玉赏鉴”的标题,在自定义词库中也增加了此词为N类型"
|
||||
words = jieba.cut(test_sent)
|
||||
for w in words:
|
||||
|
@ -14,7 +14,7 @@ if not os.path.exists("tmp"):
|
||||
os.mkdir("tmp")
|
||||
|
||||
ix = create_in("tmp", schema) # for create new index
|
||||
#ix = open_dir("tmp", schema=schema) # for read only
|
||||
#ix = open_dir("tmp") # for read only
|
||||
writer = ix.writer()
|
||||
|
||||
writer.add_document(
|
||||
|
@ -3,4 +3,5 @@
|
||||
创新办 3 i
|
||||
easy_install 3 eng
|
||||
好用 300
|
||||
韩玉赏鉴 3 nz
|
||||
韩玉赏鉴 3 nz
|
||||
八一双鹿 3 nz
|
Loading…
x
Reference in New Issue
Block a user