Merge pull request #100 from ZoeyYoung/jieba3k

Jieba3k
This commit is contained in:
Sun Junyi 2013-08-21 00:50:47 -07:00
commit d16727ba89
20 changed files with 308 additions and 139 deletions

4
.gitignore vendored
View File

@ -164,3 +164,7 @@ pip-log.txt
*.log *.log
test/tmp/* test/tmp/*
#jython
*.class
MANIFEST

View File

@ -1,3 +1,20 @@
2013-07-01: version 0.31
1. 修改了代码缩进格式遵循PEP8标准
2. 支持Jython解析器感谢 @piaolingxue
3. 修复中英混合词汇不能识别数字在前词语的Bug
4. 部分代码重构,感谢 @chao78787
5. 多进程并行分词模式下自动检测CPU个数设置合适的进程数感谢@linkerlin
6. 修复了0.3版中jieba.extra_tags方法对whoosh模块的错误依赖
2013-07-01: version 0.30
==========================
1) 新增jieba.tokenize方法返回每个词的起始位置
2) 新增ChineseAnalyzer用于支持whoosh搜索引擎
3添加了更多的中英混合词汇
4修改了一些py文件的加载方法从而支持py2exe,cxfree打包为exe
2013-06-17: version 0.29.1 2013-06-17: version 0.29.1
========================== ==========================
1) 优化了viterbi算法的代码分词速度提升15% 1) 优化了viterbi算法的代码分词速度提升15%
@ -25,8 +42,8 @@
2013-04-27: version 0.28 2013-04-27: version 0.28
======================== ========================
1) 新增词典lazy load功能用户可以在'import jieba'后再改变词典的路径. 感谢hermanschaaf 1) 新增词典lazy load功能用户可以在'import jieba'后再改变词典的路径. 感谢hermanschaaf
2) 显示词典加载异常时错误的词条信息. 感谢neuront 2) 显示词典加载异常时错误的词条信息. 感谢neuront
3) 修正了词典被vim编辑后会加载失败的bug. 感谢neuront 3) 修正了词典被vim编辑后会加载失败的bug. 感谢neuront
2013-04-22: version 0.27 2013-04-22: version 0.27
======================== ========================
@ -63,7 +80,7 @@
2012-11-28: version 0.22 2012-11-28: version 0.22
======================== ========================
1) 新增jieba.cut_for_search方法 该方法在精确分词的基础上对“长词”进行再次切分,适用于搜索引擎领域的分词,比精确分词模式有更高的召回率。 1) 新增jieba.cut_for_search方法 该方法在精确分词的基础上对“长词”进行再次切分,适用于搜索引擎领域的分词,比精确分词模式有更高的召回率。
2) 开始支持Python3.x版。 之前一直是只支持Python2.x系列从这个版本起有一个单独的jieba3k 2) 开始支持Python3.x版。 之前一直是只支持Python2.x系列从这个版本起有一个单独的jieba3k
2012-11-23: version 0.21 2012-11-23: version 0.21
@ -74,7 +91,7 @@
2012-11-06: version 0.20 2012-11-06: version 0.20
======================== ========================
1) 新增词性标注功能 1) 新增词性标注功能
2012-10-25: version 0.19 2012-10-25: version 0.19

20
LICENSE Normal file
View File

@ -0,0 +1,20 @@
The MIT License (MIT)
Copyright (c) 2013 Sun Junyi
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

2
MANIFEST.in Normal file
View File

@ -0,0 +1,2 @@
graft README.md
graft Changelog

176
README.md
View File

@ -14,9 +14,9 @@ jieba
Feature Feature
======== ========
* 支持三种分词模式: * 支持三种分词模式:
* 精确模式,试图将句子最精确地切开,适合文本分析; * 精确模式,试图将句子最精确地切开,适合文本分析;
* 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义; * 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
* 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。 * 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
* 支持繁体分词 * 支持繁体分词
* 支持自定义词典 * 支持自定义词典
@ -29,19 +29,31 @@ http://jiebademo.ap01.aws.af.cm/
(Powered by Appfog) (Powered by Appfog)
Python Version 网站代码https://github.com/fxsjy/jiebademo
==============
* 目前master分支是只支持Python2.x 的
* Python3.x 版本的分支也已经基本可用: https://github.com/fxsjy/jieba/tree/jieba3k
Usage Python 2.x 下的安装
======== ===================
* 全自动安装:`easy_install jieba` 或者 `pip install jieba` * 全自动安装:`easy_install jieba` 或者 `pip install jieba`
* 半自动安装先下载http://pypi.python.org/pypi/jieba/ 解压后运行python setup.py install * 半自动安装先下载http://pypi.python.org/pypi/jieba/ 解压后运行python setup.py install
* 手动安装将jieba目录放置于当前目录或者site-packages目录 * 手动安装将jieba目录放置于当前目录或者site-packages目录
* 通过import jieba 来引用 第一次import时需要构建Trie树需要几秒时间 * 通过import jieba 来引用 第一次import时需要构建Trie树需要几秒时间
Python 3.x 下的安装
====================
* 目前master分支是只支持Python2.x 的
* Python3.x 版本的分支也已经基本可用: https://github.com/fxsjy/jieba/tree/jieba3k
git clone https://github.com/fxsjy/jieba.git
git checkout jieba3k
python setup.py install
结巴分词Java版本
================
作者piaolingxue
地址https://github.com/huaban/jieba-analysis
Algorithm Algorithm
======== ========
* 基于Trie树结构实现高效的词图扫描生成句子中汉字所有可能成词情况所构成的有向无环图DAG) * 基于Trie树结构实现高效的词图扫描生成句子中汉字所有可能成词情况所构成的有向无环图DAG)
@ -76,13 +88,13 @@ Algorithm
Output: Output:
【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学 【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
【精确模式】: 我/ 来到/ 北京/ 清华大学 【精确模式】: 我/ 来到/ 北京/ 清华大学
【新词识别】:他, 来到, 了, 网易, 杭研, 大厦 (此处“杭研”并没有在词典中但是也被Viterbi算法识别出来了) 【新词识别】:他, 来到, 了, 网易, 杭研, 大厦 (此处“杭研”并没有在词典中但是也被Viterbi算法识别出来了)
【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造 【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
功能 2) :添加自定义词典 功能 2) :添加自定义词典
================ ================
@ -92,16 +104,16 @@ Output:
* 词典格式和`dict.txt`一样,一个词占一行;每一行分三部分,一部分为词语,另一部分为词频,最后为词性(可省略),用空格隔开 * 词典格式和`dict.txt`一样,一个词占一行;每一行分三部分,一部分为词语,另一部分为词频,最后为词性(可省略),用空格隔开
* 范例: * 范例:
* 自定义词典https://github.com/fxsjy/jieba/blob/master/test/userdict.txt * 自定义词典https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
* 之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 / * 用法示例https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
* 之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
* 加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
* 加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
* "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14 * "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
功能 3) :关键词提取 功能 3) :关键词提取
@ -112,36 +124,80 @@ Output:
代码示例 (关键词提取) 代码示例 (关键词提取)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
功能 4) : 词性标注 功能 4) : 词性标注
================ ================
* 标注句子分词后每个词的词性采用和ictclas兼容的标记法 * 标注句子分词后每个词的词性采用和ictclas兼容的标记法
* 用法示例 * 用法示例
>>> import jieba.posseg as pseg >>> import jieba.posseg as pseg
>>> words =pseg.cut("我爱北京天安门") >>> words = pseg.cut("我爱北京天安门")
>>> for w in words: >>> for w in words:
... print(w.word,w.flag) ... print w.word, w.flag
... ...
我 r 我 r
爱 v 爱 v
北京 ns 北京 ns
天安门 ns 天安门 ns
功能 5) : 并行分词 功能 5) : 并行分词
================== ==================
* 原理将目标文本按行分隔后把各行文本分配到多个python进程并行分词然后归并结果从而获得分词速度的可观提升 * 原理将目标文本按行分隔后把各行文本分配到多个python进程并行分词然后归并结果从而获得分词速度的可观提升
* 基于python自带的multiprocessing模块目前暂不支持windows * 基于python自带的multiprocessing模块目前暂不支持windows
* 用法: * 用法:
* `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数 * `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数
* `jieba.disable_parallel()` # 关闭并行分词模式 * `jieba.disable_parallel()` # 关闭并行分词模式
* 例子: * 例子:
https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
* 实验结果在4核3.4GHz Linux机器上对金庸全集进行精确分词获得了1MB/s的速度是单进程版的3.3倍。 * 实验结果在4核3.4GHz Linux机器上对金庸全集进行精确分词获得了1MB/s的速度是单进程版的3.3倍。
功能 6) : Tokenize返回词语在原文的起始位置
============================================
* 注意输入参数只接受unicode
* 默认模式
```python
result = jieba.tokenize('永和服装饰品有限公司')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0], tk[1], tk[2]))
```
```
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限公司 start: 6 end:10
```
* 搜索模式
```python
result = jieba.tokenize('永和服装饰品有限公司', mode='search')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0], tk[1], tk[2]))
```
```
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限 start: 6 end:8
word 公司 start: 8 end:10
word 有限公司 start: 6 end:10
```
功能 7) : ChineseAnalyzer for Whoosh搜索引擎
============================================
* 引用: `from jieba.analyse import ChineseAnalyzer `
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
其他词典 其他词典
======== ========
1. 占用内存较小的词典文件 1. 占用内存较小的词典文件
@ -182,14 +238,14 @@ jieba采用延迟加载"import jieba"不会立即触发词典的加载,一
常见问题 常见问题
========= =========
1模型的数据是如何生成的https://github.com/fxsjy/jieba/issues/7 1模型的数据是如何生成的https://github.com/fxsjy/jieba/issues/7
2这个库的授权是? https://github.com/fxsjy/jieba/issues/2 2这个库的授权是? https://github.com/fxsjy/jieba/issues/2
更多问题请点击https://github.com/fxsjy/jieba/issues?sort=updated&state=closed 更多问题请点击https://github.com/fxsjy/jieba/issues?sort=updated&state=closed
Change Log Change Log
========== ==========
http://www.oschina.net/p/jieba/news#list https://github.com/fxsjy/jieba/blob/master/Changelog
jieba jieba
======== ========
@ -224,30 +280,30 @@ Function 1): cut
Code example: segmentation Code example: segmentation
========== ==========
#encoding=utf-8 #encoding=utf-8
import jieba import jieba
seg_list = jieba.cut("我来到北京清华大学",cut_all=True) seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode:", "/ ".join(seg_list)) #全模式 print("Full Mode:", "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("我来到北京清华大学",cut_all=False) seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode:", "/ ".join(seg_list)) #默认模式 print("Default Mode:", "/ ".join(seg_list)) # 默认模式
seg_list = jieba.cut("他来到了网易杭研大厦") seg_list = jieba.cut("他来到了网易杭研大厦")
print(", ".join(seg_list)) print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式 seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list)) print(", ".join(seg_list))
Output: Output:
[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学 [Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
[Accurate Mode]: 我/ 来到/ 北京/ 清华大学 [Accurate Mode]: 我/ 来到/ 北京/ 清华大学
[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm) [Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)
[Search Engine Mode] 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在 [Search Engine Mode] 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在
, 日本, 京都, 大学, 日本京都大学, 深造 , 日本, 京都, 大学, 日本京都大学, 深造
@ -259,13 +315,13 @@ Function 2): Add a custom dictionary
* The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space * The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space
* Example * Example
云计算 5 云计算 5
李小福 2 李小福 2
创新办 3 创新办 3
之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 / 之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 / 加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
Function 3): Keyword Extraction Function 3): Keyword Extraction
================ ================
@ -275,7 +331,7 @@ Function 3): Keyword Extraction
Code sample (keyword extraction) Code sample (keyword extraction)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
Using Other Dictionaries Using Other Dictionaries
======== ========
@ -296,10 +352,10 @@ Initialization
By default, Jieba employs lazy loading to only build the trie once it is necessary. This takes 1-3 seconds once, after which it is not initialized again. If you want to initialize Jieba manually, you can call: By default, Jieba employs lazy loading to only build the trie once it is necessary. This takes 1-3 seconds once, after which it is not initialized again. If you want to initialize Jieba manually, you can call:
import jieba import jieba
jieba.initialize() #(optional) jieba.initialize() # (optional)
You can also specify the dictionary (not supported before version 0.28) : You can also specify the dictionary (not supported before version 0.28) :
jieba.set_dictionary('data/dict.txt.big') jieba.set_dictionary('data/dict.txt.big')
Segmentation speed Segmentation speed

View File

@ -1,10 +1,9 @@
from __future__ import with_statement __version__ = '0.31'
import re __license__ = 'MIT'
import math import re
import os import os
import sys import sys
import pprint
from . import finalseg from . import finalseg
import time import time
@ -29,7 +28,7 @@ def gen_trie(f_name):
trie = {} trie = {}
ltotal = 0.0 ltotal = 0.0
with open(f_name, 'rb') as f: with open(f_name, 'rb') as f:
lineno = 0 lineno = 0
for line in f.read().rstrip().decode('utf-8').split('\n'): for line in f.read().rstrip().decode('utf-8').split('\n'):
lineno += 1 lineno += 1
try: try:
@ -39,7 +38,7 @@ def gen_trie(f_name):
ltotal+=freq ltotal+=freq
p = trie p = trie
for c in word: for c in word:
if not c in p: if c not in p:
p[c] ={} p[c] ={}
p = p[c] p = p[c]
p['']='' #ending flag p['']='' #ending flag
@ -124,7 +123,7 @@ def __cut_all(sentence):
for k,L in dag.items(): for k,L in dag.items():
if len(L)==1 and k>old_j: if len(L)==1 and k>old_j:
yield sentence[k:L[0]+1] yield sentence[k:L[0]+1]
old_j = L[0] old_j = L[0]
else: else:
for j in L: for j in L:
if j>k: if j>k:
@ -150,7 +149,7 @@ def get_DAG(sentence):
if c in p: if c in p:
p = p[c] p = p[c]
if '' in p: if '' in p:
if not i in DAG: if i not in DAG:
DAG[i]=[] DAG[i]=[]
DAG[i].append(j) DAG[i].append(j)
j+=1 j+=1
@ -163,7 +162,7 @@ def get_DAG(sentence):
i+=1 i+=1
j=i j=i
for i in range(len(sentence)): for i in range(len(sentence)):
if not i in DAG: if i not in DAG:
DAG[i] =[i] DAG[i] =[i]
return DAG return DAG
@ -186,7 +185,7 @@ def __cut_DAG(sentence):
yield buf yield buf
buf='' buf=''
else: else:
if not (buf in FREQ): if (buf not in FREQ):
regognized = finalseg.cut(buf) regognized = finalseg.cut(buf)
for t in regognized: for t in regognized:
yield t yield t
@ -194,14 +193,14 @@ def __cut_DAG(sentence):
for elem in buf: for elem in buf:
yield elem yield elem
buf='' buf=''
yield l_word yield l_word
x =y x =y
if len(buf)>0: if len(buf)>0:
if len(buf)==1: if len(buf)==1:
yield buf yield buf
else: else:
if not (buf in FREQ): if (buf not in FREQ):
regognized = finalseg.cut(buf) regognized = finalseg.cut(buf)
for t in regognized: for t in regognized:
yield t yield t
@ -210,7 +209,7 @@ def __cut_DAG(sentence):
yield elem yield elem
def cut(sentence,cut_all=False): def cut(sentence,cut_all=False):
if( type(sentence) is bytes): if isinstance(sentence, bytes):
try: try:
sentence = sentence.decode('utf-8') sentence = sentence.decode('utf-8')
except UnicodeDecodeError: except UnicodeDecodeError:
@ -227,8 +226,9 @@ def cut(sentence,cut_all=False):
if cut_all: if cut_all:
cut_block = __cut_all cut_block = __cut_all
for blk in blocks: for blk in blocks:
if len(blk)==0:
continue
if re_han.match(blk): if re_han.match(blk):
#pprint.pprint(__cut_DAG(blk))
for word in cut_block(blk): for word in cut_block(blk):
yield word yield word
else: else:
@ -284,7 +284,7 @@ def add_word(word, freq, tag=None):
user_word_tag_tab[word] = tag.strip() user_word_tag_tab[word] = tag.strip()
p = trie p = trie
for c in word: for c in word:
if not c in p: if c not in p:
p[c] = {} p[c] = {}
p = p[c] p = p[c]
p[''] = '' # ending flag p[''] = '' # ending flag
@ -299,19 +299,23 @@ def __lcut_all(sentence):
def __lcut_for_search(sentence): def __lcut_for_search(sentence):
return list(__ref_cut_for_search(sentence)) return list(__ref_cut_for_search(sentence))
@require_initialized @require_initialized
def enable_parallel(processnum): def enable_parallel(processnum=None):
global pool,cut,cut_for_search global pool,cut,cut_for_search
if os.name=='nt': if os.name=='nt':
raise Exception("parallel mode only supports posix system") raise Exception("jieba: parallel mode only supports posix system")
if sys.version_info[0]==2 and sys.version_info[1]<6:
from multiprocessing import Pool raise Exception("jieba: the parallel feature needs Python version>2.5 ")
from multiprocessing import Pool,cpu_count
if processnum==None:
processnum = cpu_count()
pool = Pool(processnum) pool = Pool(processnum)
def pcut(sentence,cut_all=False): def pcut(sentence,cut_all=False):
parts = re.compile(b'([\r\n]+)').split(sentence) parts = re.compile(b'([\r\n]+)').split(sentence)
if cut_all: if cut_all:
result = pool.map(__lcut_all,parts) result = pool.map(__lcut_all,parts)
else: else:
result = pool.map(__lcut,parts) result = pool.map(__lcut,parts)
for r in result: for r in result:
@ -341,7 +345,7 @@ def set_dictionary(dictionary_path):
with DICT_LOCK: with DICT_LOCK:
abs_path = os.path.normpath( os.path.join( os.getcwd(), dictionary_path ) ) abs_path = os.path.normpath( os.path.join( os.getcwd(), dictionary_path ) )
if not os.path.exists(abs_path): if not os.path.exists(abs_path):
raise Exception("path does not exists:" + abs_path) raise Exception("jieba: path does not exists:" + abs_path)
DICTIONARY = abs_path DICTIONARY = abs_path
initialized = False initialized = False
@ -353,8 +357,8 @@ def get_abs_path_dict():
def tokenize(unicode_sentence,mode="default"): def tokenize(unicode_sentence,mode="default"):
#mode ("default" or "search") #mode ("default" or "search")
if not isinstance(unicode_sentence, str): if not isinstance(unicode_sentence, str):
raise Exception("jieba: the input parameter should string.") raise Exception("jieba: the input parameter should unicode.")
start = 0 start = 0
if mode=='default': if mode=='default':
for w in cut(unicode_sentence): for w in cut(unicode_sentence):
width = len(w) width = len(w)

View File

@ -2,9 +2,9 @@ import jieba
import os import os
try: try:
from analyzer import ChineseAnalyzer from analyzer import ChineseAnalyzer
except ImportError: except ImportError:
pass pass
_curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) ) ) _curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) ) )
f_name = os.path.join(_curpath,"idf.txt") f_name = os.path.join(_curpath,"idf.txt")

View File

@ -1,6 +1,6 @@
#encoding=utf-8 #encoding=utf-8
from whoosh.analysis import RegexAnalyzer,LowercaseFilter,StopFilter from whoosh.analysis import RegexAnalyzer,LowercaseFilter,StopFilter
from whoosh.analysis import Tokenizer,Token from whoosh.analysis import Tokenizer,Token
import jieba import jieba
import re import re
@ -31,4 +31,4 @@ class ChineseTokenizer(Tokenizer):
yield token yield token
def ChineseAnalyzer(stoplist=STOP_WORDS,minsize=1): def ChineseAnalyzer(stoplist=STOP_WORDS,minsize=1):
return ChineseTokenizer() | LowercaseFilter() | StopFilter(stoplist=stoplist,minsize=minsize) return ChineseTokenizer() | LowercaseFilter() | StopFilter(stoplist=stoplist,minsize=minsize)

View File

@ -1,12 +1,15 @@
import re import re
import os import os
from math import log import marshal
from . import prob_start import sys
from . import prob_trans
from . import prob_emit
MIN_FLOAT=-3.14e100 MIN_FLOAT=-3.14e100
PROB_START_P = "prob_start.p"
PROB_TRANS_P = "prob_trans.p"
PROB_EMIT_P = "prob_emit.p"
PrevStatus = { PrevStatus = {
'B':('E','S'), 'B':('E','S'),
'M':('M','B'), 'M':('M','B'),
@ -14,6 +17,35 @@ PrevStatus = {
'E':('B','M') 'E':('B','M')
} }
def load_model():
_curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) ) )
start_p = {}
abs_path = os.path.join(_curpath, PROB_START_P)
with open(abs_path, mode='rb') as f:
start_p = marshal.load(f)
f.closed
trans_p = {}
abs_path = os.path.join(_curpath, PROB_TRANS_P)
with open(abs_path, 'rb') as f:
trans_p = marshal.load(f)
f.closed
emit_p = {}
abs_path = os.path.join(_curpath, PROB_EMIT_P)
with file(abs_path, 'rb') as f:
emit_p = marshal.load(f)
f.closed
return start_p, trans_p, emit_p
if sys.platform.startswith("java"):
start_P, trans_P, emit_P = load_model()
else:
import prob_start,prob_trans,prob_emit
start_P, trans_P, emit_P = prob_start.P, prob_trans.P, prob_emit.P
def viterbi(obs, states, start_p, trans_p, emit_p): def viterbi(obs, states, start_p, trans_p, emit_p):
V = [{}] #tabular V = [{}] #tabular
path = {} path = {}
@ -29,14 +61,15 @@ def viterbi(obs, states, start_p, trans_p, emit_p):
V[t][y] =prob V[t][y] =prob
newpath[y] = path[state] + [y] newpath[y] = path[state] + [y]
path = newpath path = newpath
(prob, state) = max([(V[len(obs) - 1][y], y) for y in ('E','S')]) (prob, state) = max([(V[len(obs) - 1][y], y) for y in ('E','S')])
return (prob, path[state]) return (prob, path[state])
def __cut(sentence): def __cut(sentence):
prob, pos_list = viterbi(sentence,('B','M','E','S'), prob_start.P, prob_trans.P, prob_emit.P) global emit_P
prob, pos_list = viterbi(sentence,('B','M','E','S'), start_P, trans_P, emit_P)
begin, next = 0,0 begin, next = 0,0
#print pos_list, sentence #print pos_list, sentence
for i,char in enumerate(sentence): for i,char in enumerate(sentence):

BIN
jieba/finalseg/prob_emit.p Normal file

Binary file not shown.

BIN
jieba/finalseg/prob_start.p Normal file

Binary file not shown.

BIN
jieba/finalseg/prob_trans.p Normal file

Binary file not shown.

View File

@ -3,29 +3,62 @@ import os
from . import viterbi from . import viterbi
import jieba import jieba
import sys import sys
from . import prob_start import marshal
from . import prob_trans
from . import prob_emit
from . import char_state_tab
default_encoding = sys.getfilesystemencoding() default_encoding = sys.getfilesystemencoding()
def load_model(f_name): PROB_START_P = "prob_start.p"
PROB_TRANS_P = "prob_trans.p"
PROB_EMIT_P = "prob_emit.p"
CHAR_STATE_TAB_P = "char_state_tab.p"
def load_model(f_name,isJython=True):
_curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) ) ) _curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) ) )
prob_p_path = os.path.join(_curpath,f_name)
if f_name.endswith(".py"): result = {}
return eval(open(prob_p_path,"rb").read()) with file(f_name, "rb") as f:
else:
result = {}
for line in open(f_name,"rb"): for line in open(f_name,"rb"):
line = line.strip() line = line.strip()
if line=="":continue if line=="":continue
line = line.decode("utf-8") line = line.decode("utf-8")
word, _, tag = line.split(" ") word, _, tag = line.split(" ")
result[word]=tag result[word]=tag
f.closed
if not isJython:
return result return result
word_tag_tab = load_model(jieba.get_abs_path_dict()) start_p = {}
abs_path = os.path.join(_curpath, PROB_START_P)
with open(abs_path, mode='rb') as f:
start_p = marshal.load(f)
f.closed
trans_p = {}
abs_path = os.path.join(_curpath, PROB_TRANS_P)
with open(abs_path, 'rb') as f:
trans_p = marshal.load(f)
f.closed
emit_p = {}
abs_path = os.path.join(_curpath, PROB_EMIT_P)
with file(abs_path, 'rb') as f:
emit_p = marshal.load(f)
f.closed
state = {}
abs_path = os.path.join(_curpath, CHAR_STATE_TAB_P)
with file(abs_path, 'rb') as f:
state = marshal.load(f)
f.closed
return state, start_p, trans_p, emit_p, result
if sys.platform.startswith("java"):
char_state_tab_P, start_P, trans_P, emit_P, word_tag_tab = load_model(jieba.get_abs_path_dict())
else:
import char_state_tab, prob_start, prob_trans, prob_emit
char_state_tab_P, start_P, trans_P, emit_P = char_state_tab.P, prob_start.P, prob_trans.P, prob_emit.P
word_tag_tab = load_model(jieba.get_abs_path_dict(),isJython=False)
if jieba.user_word_tag_tab: if jieba.user_word_tag_tab:
word_tag_tab.update(jieba.user_word_tag_tab) word_tag_tab.update(jieba.user_word_tag_tab)
@ -48,7 +81,7 @@ class pair(object):
return self.__unicode__().encode(arg) return self.__unicode__().encode(arg)
def __cut(sentence): def __cut(sentence):
prob, pos_list = viterbi.viterbi(sentence,char_state_tab.P, prob_start.P, prob_trans.P, prob_emit.P) prob, pos_list = viterbi.viterbi(sentence,char_state_tab_P, start_P, trans_P, emit_P)
begin, next = 0,0 begin, next = 0,0
for i,char in enumerate(sentence): for i,char in enumerate(sentence):
@ -88,7 +121,7 @@ def __cut_detail(sentence):
def __cut_DAG(sentence): def __cut_DAG(sentence):
DAG = jieba.get_DAG(sentence) DAG = jieba.get_DAG(sentence)
route ={} route ={}
jieba.calc(sentence,DAG,0,route=route) jieba.calc(sentence,DAG,0,route=route)
x = 0 x = 0
@ -105,7 +138,7 @@ def __cut_DAG(sentence):
yield pair(buf,word_tag_tab.get(buf,'x')) yield pair(buf,word_tag_tab.get(buf,'x'))
buf='' buf=''
else: else:
if not (buf in jieba.FREQ): if (buf not in jieba.FREQ):
regognized = __cut_detail(buf) regognized = __cut_detail(buf)
for t in regognized: for t in regognized:
yield t yield t
@ -120,7 +153,7 @@ def __cut_DAG(sentence):
if len(buf)==1: if len(buf)==1:
yield pair(buf,word_tag_tab.get(buf,'x')) yield pair(buf,word_tag_tab.get(buf,'x'))
else: else:
if not (buf in jieba.FREQ): if (buf not in jieba.FREQ):
regognized = __cut_detail(buf) regognized = __cut_detail(buf)
for t in regognized: for t in regognized:
yield t yield t
@ -129,7 +162,7 @@ def __cut_DAG(sentence):
yield pair(elem,word_tag_tab.get(elem,'x')) yield pair(elem,word_tag_tab.get(elem,'x'))
def __cut_internal(sentence): def __cut_internal(sentence):
if not ( type(sentence) is str): if not isinstance(sentence, str):
try: try:
sentence = sentence.decode('utf-8') sentence = sentence.decode('utf-8')
except: except:
@ -166,7 +199,7 @@ def cut(sentence):
yield w yield w
else: else:
parts = re.compile('([\r\n]+)').split(sentence) parts = re.compile('([\r\n]+)').split(sentence)
result = jieba.pool.map(__lcut_internal,parts) result = jieba.pool.map(__lcut_internal,parts)
for r in result: for r in result:
for w in r: for w in r:
yield w yield w

Binary file not shown.

BIN
jieba/posseg/prob_emit.p Normal file

Binary file not shown.

BIN
jieba/posseg/prob_start.p Normal file

Binary file not shown.

BIN
jieba/posseg/prob_trans.p Normal file

Binary file not shown.

View File

@ -1,6 +1,6 @@
from distutils.core import setup from distutils.core import setup
setup(name='jieba', setup(name='jieba',
version='0.29.1', version='0.31',
description='Chinese Words Segementation Utilities', description='Chinese Words Segementation Utilities',
author='Sun, Junyi', author='Sun, Junyi',
author_email='ccnusjy@gmail.com', author_email='ccnusjy@gmail.com',

View File

@ -2,18 +2,20 @@ import sys,time
import sys import sys
sys.path.append("../../") sys.path.append("../../")
import jieba import jieba
jieba.enable_parallel(4)
jieba.enable_parallel()
url = sys.argv[1] url = sys.argv[1]
content = open(url,"rb").read() with open(url,"rb") as content:
t1 = time.time() content = content.read()
words = list(jieba.cut(content)) t1 = time.time()
words = "/ ".join(jieba.cut(content))
t2 = time.time()
tm_cost = t2-t1
print('cost',tm_cost)
print('speed' , len(content)/tm_cost, " bytes/second")
t2 = time.time() with open("1.log","wb") as log_f:
tm_cost = t2-t1 log_f.write(words.encode('utf-8'))
log_f = open("1.log","wb")
for w in words:
log_f.write(w.encode("utf-8"))
print('speed' , len(content)/tm_cost, " bytes/second")

View File

@ -5,17 +5,15 @@ import jieba
jieba.initialize() jieba.initialize()
url = sys.argv[1] url = sys.argv[1]
content = open(url,"rb").read() with open(url,"rb") as content:
t1 = time.time() content = content.read()
words = list(jieba.cut(content)) t1 = time.time()
words = "/ ".join(jieba.cut(content))
t2 = time.time() t2 = time.time()
tm_cost = t2-t1 tm_cost = t2-t1
print('cost',tm_cost)
log_f = open("1.log","wb") print('speed' , len(content)/tm_cost, " bytes/second")
log_f.write(bytes("/ ".join(words),'utf-8'))
print('cost',tm_cost)
print('speed' , len(content)/tm_cost, " bytes/second")
with open("1.log","wb") as log_f:
log_f.write(words.encode('utf-8'))
log_f.write(bytes("/ ".join(words),'utf-8'))