Merge pull request #100 from ZoeyYoung/jieba3k

Jieba3k
2025-07-24 00:00:05 +08:00 · 2013-08-21 00:50:47 -07:00 · 2013-08-21 00:50:47 -07:00 · d16727ba89
commit d16727ba89
parent 6549deabbd dce353f88b
20 changed files with 308 additions and 139 deletions
--- a/.gitignore
+++ b/.gitignore
@ -164,3 +164,7 @@ pip-log.txt
 *.log
 test/tmp/*
 #jython
 *.class
 MANIFEST
--- a/25
+++ b/25
@ -1,3 +1,20 @@
 2013-07-01: version 0.31
 1. 修改了代码缩进格式，遵循PEP8标准
 2. 支持Jython解析器，感谢 @piaolingxue
 3. 修复中英混合词汇不能识别数字在前词语的Bug
 4. 部分代码重构，感谢 @chao78787
 5. 多进程并行分词模式下自动检测CPU个数设置合适的进程数，感谢@linkerlin
 6. 修复了0.3版中jieba.extra_tags方法对whoosh模块的错误依赖
 2013-07-01: version 0.30
 ==========================
 1) 新增jieba.tokenize方法，返回每个词的起始位置
 2) 新增ChineseAnalyzer，用于支持whoosh搜索引擎
 3）添加了更多的中英混合词汇
 4）修改了一些py文件的加载方法，从而支持py2exe,cxfree打包为exe
 2013-06-17: version 0.29.1
 ==========================
 1) 优化了viterbi算法的代码，分词速度提升15%
@ -25,8 +42,8 @@
 2013-04-27: version 0.28
 ========================
 1) 新增词典lazy load功能，用户可以在'import jieba'后再改变词典的路径. 感谢hermanschaaf
-2) 显示词典加载异常时错误的词条信息. 感谢neuront 
+2) 显示词典加载异常时错误的词条信息. 感谢neuront
-3) 修正了词典被vim编辑后会加载失败的bug. 感谢neuront  
+3) 修正了词典被vim编辑后会加载失败的bug. 感谢neuront
 2013-04-22: version 0.27
 ========================
@ -63,7 +80,7 @@
 2012-11-28: version 0.22
 ========================
 1) 新增jieba.cut_for_search方法， 该方法在精确分词的基础上对“长词”进行再次切分，适用于搜索引擎领域的分词，比精确分词模式有更高的召回率。
-2) 开始支持Python3.x版。 之前一直是只支持Python2.x系列，从这个版本起有一个单独的jieba3k 
+2) 开始支持Python3.x版。 之前一直是只支持Python2.x系列，从这个版本起有一个单独的jieba3k
 2012-11-23: version 0.21
@ -74,7 +91,7 @@
 2012-11-06: version 0.20
 ========================
-1) 新增词性标注功能 
+1) 新增词性标注功能
 2012-10-25: version 0.19
--- a/20
+++ b/20
@ -0,0 +1,20 @@
 The MIT License (MIT)
 Copyright (c) 2013 Sun Junyi
 Permission is hereby granted, free of charge, to any person obtaining a copy of
 this software and associated documentation files (the "Software"), to deal in
 the Software without restriction, including without limitation the rights to
 use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 the Software, and to permit persons to whom the Software is furnished to do so,
 subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
 FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -0,0 +1,2 @@
 graft README.md
 graft Changelog
--- a/README.md
+++ b/README.md
@ -14,9 +14,9 @@ jieba
 Feature
 ========
 * 支持三种分词模式：
-	* 精确模式，试图将句子最精确地切开，适合文本分析；
+    * 精确模式，试图将句子最精确地切开，适合文本分析；
-	* 全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义；
+    * 全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义；
-	* 搜索引擎模式，在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词。
+    * 搜索引擎模式，在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词。
 * 支持繁体分词
 * 支持自定义词典
@ -29,19 +29,31 @@ http://jiebademo.ap01.aws.af.cm/
 (Powered by Appfog)
-Python Version
+网站代码：https://github.com/fxsjy/jiebademo
 ==============
 * 目前master分支是只支持Python2.x 的
 * Python3.x 版本的分支也已经基本可用： https://github.com/fxsjy/jieba/tree/jieba3k
-Usage
+Python 2.x 下的安装
-========
+===================
 * 全自动安装：`easy_install jieba` 或者 `pip install jieba`
 * 半自动安装：先下载http://pypi.python.org/pypi/jieba/ ，解压后运行python setup.py install
 * 手动安装：将jieba目录放置于当前目录或者site-packages目录
 * 通过import jieba 来引用 （第一次import时需要构建Trie树，需要几秒时间）
 Python 3.x 下的安装
 ====================
 * 目前master分支是只支持Python2.x 的
 * Python3.x 版本的分支也已经基本可用： https://github.com/fxsjy/jieba/tree/jieba3k
        git clone https://github.com/fxsjy/jieba.git
        git checkout jieba3k
        python setup.py install
 结巴分词Java版本
 ================
 作者：piaolingxue
 地址：https://github.com/huaban/jieba-analysis
 Algorithm
 ========
 * 基于Trie树结构实现高效的词图扫描，生成句子中汉字所有可能成词情况所构成的有向无环图（DAG)
@ -76,13 +88,13 @@ Algorithm
 Output:
-	【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
+    【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
-	【精确模式】: 我/ 来到/ 北京/ 清华大学
+    【精确模式】: 我/ 来到/ 北京/ 清华大学
-	【新词识别】：他, 来到, 了, 网易, 杭研, 大厦    (此处，“杭研”并没有在词典中，但是也被Viterbi算法识别出来了)
+    【新词识别】：他, 来到, 了, 网易, 杭研, 大厦    (此处，“杭研”并没有在词典中，但是也被Viterbi算法识别出来了)
-	【搜索引擎模式】： 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
+    【搜索引擎模式】： 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
 功能 2) ：添加自定义词典
 ================
@ -92,16 +104,16 @@ Output:
 * 词典格式和`dict.txt`一样，一个词占一行；每一行分三部分，一部分为词语，另一部分为词频，最后为词性（可省略），用空格隔开
 * 范例：
-	* 自定义词典：https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
+    * 自定义词典：https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
 	* 用法示例：https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
-		* 之前： 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
+    * 用法示例：https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
        * 之前： 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
        * 加载自定义词库后：　李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
 		* 加载自定义词库后：　李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
 * "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
 功能 3) ：关键词提取
@ -112,36 +124,80 @@ Output:
 代码示例 （关键词提取）
-	https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
+    https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
 功能 4) : 词性标注
 ================
 * 标注句子分词后每个词的词性，采用和ictclas兼容的标记法
 * 用法示例
-		>>> import jieba.posseg as pseg
+        >>> import jieba.posseg as pseg
-		>>> words =pseg.cut("我爱北京天安门")
+        >>> words = pseg.cut("我爱北京天安门")
-		>>> for w in words:
+        >>> for w in words:
-		...    print(w.word,w.flag)
+        ...    print w.word, w.flag
-		...
+        ...
-		我 r
+        我 r
-		爱 v
+        爱 v
-		北京 ns
+        北京 ns
-		天安门 ns
+        天安门 ns
-		
+
 功能 5) : 并行分词
 ==================
 * 原理：将目标文本按行分隔后，把各行文本分配到多个python进程并行分词，然后归并结果，从而获得分词速度的可观提升
 * 基于python自带的multiprocessing模块，目前暂不支持windows
 * 用法：
-	* `jieba.enable_parallel(4)` # 开启并行分词模式，参数为并行进程数
+    * `jieba.enable_parallel(4)` # 开启并行分词模式，参数为并行进程数
-	* `jieba.disable_parallel()` # 关闭并行分词模式
+    * `jieba.disable_parallel()` # 关闭并行分词模式
 * 例子：
-		https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
+        https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
 * 实验结果：在4核3.4GHz Linux机器上，对金庸全集进行精确分词，获得了1MB/s的速度，是单进程版的3.3倍。
 功能 6) : Tokenize：返回词语在原文的起始位置
 ============================================
 * 注意，输入参数只接受unicode
 * 默认模式
 ```python
 result = jieba.tokenize('永和服装饰品有限公司')
 for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0], tk[1], tk[2]))
 ```
 ```
 word 永和                start: 0                end:2
 word 服装                start: 2                end:4
 word 饰品                start: 4                end:6
 word 有限公司            start: 6                end:10
 ```
 * 搜索模式
 ```python
 result = jieba.tokenize('永和服装饰品有限公司', mode='search')
 for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0], tk[1], tk[2]))
 ```
 ```
 word 永和                start: 0                end:2
 word 服装                start: 2                end:4
 word 饰品                start: 4                end:6
 word 有限                start: 6                end:8
 word 公司                start: 8                end:10
 word 有限公司            start: 6                end:10
 ```
 功能 7) : ChineseAnalyzer for Whoosh搜索引擎
 ============================================
 * 引用： `from jieba.analyse import ChineseAnalyzer `
 * 用法示例：https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
 其他词典
 ========
 1. 占用内存较小的词典文件
@ -182,14 +238,14 @@ jieba采用延迟加载，"import jieba"不会立即触发词典的加载，一
 常见问题
 =========
 1）模型的数据是如何生成的？https://github.com/fxsjy/jieba/issues/7
- 
+
 2）这个库的授权是? https://github.com/fxsjy/jieba/issues/2
- 
+
 更多问题请点击：https://github.com/fxsjy/jieba/issues?sort=updated&state=closed
- 
+
 Change Log
 ==========
-http://www.oschina.net/p/jieba/news#list
+https://github.com/fxsjy/jieba/blob/master/Changelog
 jieba
 ========
@ -224,30 +280,30 @@ Function 1): cut
 Code example: segmentation
 ==========
-	#encoding=utf-8
+    #encoding=utf-8
-	import jieba
+    import jieba
-	seg_list = jieba.cut("我来到北京清华大学",cut_all=True)
+    seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
-	print("Full Mode:", "/ ".join(seg_list)) #全模式
+    print("Full Mode:", "/ ".join(seg_list))  # 全模式
-	seg_list = jieba.cut("我来到北京清华大学",cut_all=False)
+    seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
-	print("Default Mode:", "/ ".join(seg_list)) #默认模式
+    print("Default Mode:", "/ ".join(seg_list))  # 默认模式
-	seg_list = jieba.cut("他来到了网易杭研大厦")
+    seg_list = jieba.cut("他来到了网易杭研大厦")
-	print(", ".join(seg_list))
+    print(", ".join(seg_list))
-	seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造") #搜索引擎模式
+    seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  # 搜索引擎模式
-	print(", ".join(seg_list))
+    print(", ".join(seg_list))
 Output:
-	[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
+    [Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
-	[Accurate Mode]: 我/ 来到/ 北京/ 清华大学
+    [Accurate Mode]: 我/ 来到/ 北京/ 清华大学
-	[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦    (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)
+    [Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦    (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)
-	[Search Engine Mode]： 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在
+    [Search Engine Mode]： 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在
 , 日本, 京都, 大学, 日本京都大学, 深造
@ -259,13 +315,13 @@ Function 2): Add a custom dictionary
 * The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space
 * Example：
-		云计算 5
+        云计算 5
-		李小福 2
+        李小福 2
-		创新办 3
+        创新办 3
-		之前： 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
+        之前： 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
-		加载自定义词库后：　李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
+        加载自定义词库后：　李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
 Function 3): Keyword Extraction
 ================
@ -275,7 +331,7 @@ Function 3): Keyword Extraction
 Code sample (keyword extraction)
-	https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
+    https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
 Using Other Dictionaries
 ========
@ -296,10 +352,10 @@ Initialization
 By default, Jieba employs lazy loading to only build the trie once it is necessary. This takes 1-3 seconds once, after which it is not initialized again. If you want to initialize Jieba manually, you can call:
    import jieba
-    jieba.initialize() #(optional)
+    jieba.initialize()  # (optional)
 You can also specify the dictionary (not supported before version 0.28) :
-    
+
    jieba.set_dictionary('data/dict.txt.big')
 Segmentation speed
--- a/jieba/init.py
+++ b/jieba/init.py
@ -1,10 +1,9 @@
-from __future__ import with_statement
+__version__ = '0.31'
-import re
+__license__ = 'MIT'
-import math
+import re
 import os
 import sys
 import pprint
 from . import finalseg
 import time
@ -29,7 +28,7 @@ def gen_trie(f_name):
    trie = {}
    ltotal = 0.0
    with open(f_name, 'rb') as f:
-        lineno = 0 
+        lineno = 0
        for line in f.read().rstrip().decode('utf-8').split('\n'):
            lineno += 1
            try:
@ -39,7 +38,7 @@ def gen_trie(f_name):
                ltotal+=freq
                p = trie
                for c in word:
-                    if not c in p:
+                    if c not in p:
                        p[c] ={}
                    p = p[c]
                p['']='' #ending flag
@ -124,7 +123,7 @@ def __cut_all(sentence):
    for k,L in dag.items():
        if len(L)==1 and k>old_j:
            yield sentence[k:L[0]+1]
-            old_j = L[0] 
+            old_j = L[0]
        else:
            for j in L:
                if j>k:
@ -150,7 +149,7 @@ def get_DAG(sentence):
        if c in p:
            p = p[c]
            if '' in p:
-                if not i in DAG:
+                if i not in DAG:
                    DAG[i]=[]
                DAG[i].append(j)
            j+=1
@ -163,7 +162,7 @@ def get_DAG(sentence):
            i+=1
            j=i
    for i in range(len(sentence)):
-        if not i in DAG:
+        if i not in DAG:
            DAG[i] =[i]
    return DAG
@ -186,7 +185,7 @@ def __cut_DAG(sentence):
                    yield buf
                    buf=''
                else:
-                    if not (buf in FREQ):
+                    if (buf not in FREQ):
                        regognized = finalseg.cut(buf)
                        for t in regognized:
                            yield t
@ -194,14 +193,14 @@ def __cut_DAG(sentence):
                        for elem in buf:
                            yield elem
                    buf=''
-            yield l_word        
+            yield l_word
        x =y
    if len(buf)>0:
        if len(buf)==1:
            yield buf
        else:
-            if not (buf in FREQ):
+            if (buf not in FREQ):
                regognized = finalseg.cut(buf)
                for t in regognized:
                    yield t
@ -210,7 +209,7 @@ def __cut_DAG(sentence):
                    yield elem
 def cut(sentence,cut_all=False):
-    if( type(sentence) is bytes):
+    if isinstance(sentence, bytes):
        try:
            sentence = sentence.decode('utf-8')
        except UnicodeDecodeError:
@ -227,8 +226,9 @@ def cut(sentence,cut_all=False):
    if cut_all:
        cut_block = __cut_all
    for blk in blocks:
        if len(blk)==0:
            continue
        if re_han.match(blk):
            #pprint.pprint(__cut_DAG(blk))
            for word in cut_block(blk):
                yield word
        else:
@ -284,7 +284,7 @@ def add_word(word, freq, tag=None):
        user_word_tag_tab[word] = tag.strip()
    p = trie
    for c in word:
-        if not c in p:
+        if c not in p:
            p[c] = {}
        p = p[c]
    p[''] = ''                  # ending flag
@ -299,19 +299,23 @@ def __lcut_all(sentence):
 def __lcut_for_search(sentence):
    return list(__ref_cut_for_search(sentence))
@require_initialized
-def enable_parallel(processnum):
+def enable_parallel(processnum=None):
    global pool,cut,cut_for_search
    if os.name=='nt':
-        raise Exception("parallel mode only supports posix system")
+        raise Exception("jieba: parallel mode only supports posix system")
-
+    if sys.version_info[0]==2 and sys.version_info[1]<6:
-    from multiprocessing import Pool
+        raise Exception("jieba: the parallel feature needs Python version>2.5 ")
    from multiprocessing import Pool,cpu_count
    if processnum==None:
        processnum = cpu_count()
    pool = Pool(processnum)
    def pcut(sentence,cut_all=False):
        parts = re.compile(b'([\r\n]+)').split(sentence)
        if cut_all:
-            result = pool.map(__lcut_all,parts) 
+            result = pool.map(__lcut_all,parts)
        else:
            result = pool.map(__lcut,parts)
        for r in result:
@ -341,7 +345,7 @@ def set_dictionary(dictionary_path):
    with DICT_LOCK:
        abs_path = os.path.normpath( os.path.join( os.getcwd(), dictionary_path )  )
        if not os.path.exists(abs_path):
-            raise Exception("path does not exists:" + abs_path)
+            raise Exception("jieba: path does not exists:" + abs_path)
        DICTIONARY = abs_path
        initialized = False
@ -353,8 +357,8 @@ def get_abs_path_dict():
 def tokenize(unicode_sentence,mode="default"):
    #mode ("default" or "search")
    if not isinstance(unicode_sentence, str):
-        raise Exception("jieba: the input parameter should  string.")
+        raise Exception("jieba: the input parameter should  unicode.")
-    start = 0 
+    start = 0
    if mode=='default':
        for w in cut(unicode_sentence):
            width = len(w)
--- a/jieba/analyse/init.py
+++ b/jieba/analyse/init.py
@ -2,9 +2,9 @@ import jieba
 import os
 try:
-	from analyzer import ChineseAnalyzer
+    from analyzer import ChineseAnalyzer
 except ImportError:
-	pass
+    pass
 _curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) )  )
 f_name = os.path.join(_curpath,"idf.txt")
--- a/jieba/analyse/analyzer.py
+++ b/jieba/analyse/analyzer.py
@ -1,6 +1,6 @@
 #encoding=utf-8
 from whoosh.analysis import RegexAnalyzer,LowercaseFilter,StopFilter
-from whoosh.analysis import Tokenizer,Token 
+from whoosh.analysis import Tokenizer,Token
 import jieba
 import re
@ -31,4 +31,4 @@ class ChineseTokenizer(Tokenizer):
            yield token
 def ChineseAnalyzer(stoplist=STOP_WORDS,minsize=1):
-    return ChineseTokenizer() | LowercaseFilter() | StopFilter(stoplist=stoplist,minsize=minsize)
+    return ChineseTokenizer() | LowercaseFilter() | StopFilter(stoplist=stoplist,minsize=minsize)
--- a/jieba/finalseg/init.py
+++ b/jieba/finalseg/init.py
@ -1,12 +1,15 @@
 import re
 import os
-from math import log
+import marshal
-from . import prob_start
+import sys
 from . import prob_trans
 from . import prob_emit
 MIN_FLOAT=-3.14e100
 PROB_START_P = "prob_start.p"
 PROB_TRANS_P = "prob_trans.p"
 PROB_EMIT_P = "prob_emit.p"
 PrevStatus = {
    'B':('E','S'),
    'M':('M','B'),
@ -14,6 +17,35 @@ PrevStatus = {
    'E':('B','M')
 }
 def load_model():
    _curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) )  )
    start_p = {}
    abs_path = os.path.join(_curpath, PROB_START_P)
    with open(abs_path, mode='rb') as f:
        start_p = marshal.load(f)
    f.closed
    trans_p = {}
    abs_path = os.path.join(_curpath, PROB_TRANS_P)
    with open(abs_path, 'rb') as f:
        trans_p = marshal.load(f)
    f.closed
    emit_p = {}
    abs_path = os.path.join(_curpath, PROB_EMIT_P)
    with file(abs_path, 'rb') as f:
        emit_p = marshal.load(f)
    f.closed
    return start_p, trans_p, emit_p
 if sys.platform.startswith("java"):
    start_P, trans_P, emit_P = load_model()
 else:
    import prob_start,prob_trans,prob_emit
    start_P, trans_P, emit_P = prob_start.P, prob_trans.P, prob_emit.P
 def viterbi(obs, states, start_p, trans_p, emit_p):
    V = [{}] #tabular
    path = {}
@ -29,14 +61,15 @@ def viterbi(obs, states, start_p, trans_p, emit_p):
            V[t][y] =prob
            newpath[y] = path[state] + [y]
        path = newpath
-    
+
    (prob, state) = max([(V[len(obs) - 1][y], y) for y in ('E','S')])
-    
+
    return (prob, path[state])
 def __cut(sentence):
-    prob, pos_list =  viterbi(sentence,('B','M','E','S'), prob_start.P, prob_trans.P, prob_emit.P)
+    global emit_P
    prob, pos_list =  viterbi(sentence,('B','M','E','S'), start_P, trans_P, emit_P)
    begin, next = 0,0
    #print pos_list, sentence
    for i,char in enumerate(sentence):
--- a/jieba/finalseg/prob_emit.p
+++ b/jieba/finalseg/prob_emit.p
--- a/jieba/finalseg/prob_start.p
+++ b/jieba/finalseg/prob_start.p
--- a/jieba/finalseg/prob_trans.p
+++ b/jieba/finalseg/prob_trans.p
--- a/jieba/posseg/init.py
+++ b/jieba/posseg/init.py
@ -3,29 +3,62 @@ import os
 from . import viterbi
 import jieba
 import sys
-from . import prob_start
+import marshal
 from . import prob_trans
 from . import prob_emit
 from . import char_state_tab
 default_encoding = sys.getfilesystemencoding()
-def load_model(f_name):
+PROB_START_P = "prob_start.p"
 PROB_TRANS_P = "prob_trans.p"
 PROB_EMIT_P = "prob_emit.p"
 CHAR_STATE_TAB_P = "char_state_tab.p"
 def load_model(f_name,isJython=True):
    _curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) )  )
-    prob_p_path = os.path.join(_curpath,f_name)
+
-    if f_name.endswith(".py"):
+    result = {}
-        return eval(open(prob_p_path,"rb").read())
+    with file(f_name, "rb") as f:
    else:
        result = {}
        for line in open(f_name,"rb"):
            line = line.strip()
            if line=="":continue
            line = line.decode("utf-8")
            word, _, tag = line.split(" ")
            result[word]=tag
    f.closed
    if not isJython:
        return result
-word_tag_tab = load_model(jieba.get_abs_path_dict())
+    start_p = {}
    abs_path = os.path.join(_curpath, PROB_START_P)
    with open(abs_path, mode='rb') as f:
        start_p = marshal.load(f)
    f.closed
    trans_p = {}
    abs_path = os.path.join(_curpath, PROB_TRANS_P)
    with open(abs_path, 'rb') as f:
        trans_p = marshal.load(f)
    f.closed
    emit_p = {}
    abs_path = os.path.join(_curpath, PROB_EMIT_P)
    with file(abs_path, 'rb') as f:
        emit_p = marshal.load(f)
    f.closed
    state = {}
    abs_path = os.path.join(_curpath, CHAR_STATE_TAB_P)
    with file(abs_path, 'rb') as f:
        state = marshal.load(f)
    f.closed
    return state, start_p, trans_p, emit_p, result
 if sys.platform.startswith("java"):
    char_state_tab_P, start_P, trans_P, emit_P, word_tag_tab = load_model(jieba.get_abs_path_dict())
 else:
    import char_state_tab, prob_start, prob_trans, prob_emit
    char_state_tab_P, start_P, trans_P, emit_P = char_state_tab.P, prob_start.P, prob_trans.P, prob_emit.P
    word_tag_tab = load_model(jieba.get_abs_path_dict(),isJython=False)
 if jieba.user_word_tag_tab:
    word_tag_tab.update(jieba.user_word_tag_tab)
@ -48,7 +81,7 @@ class pair(object):
        return self.__unicode__().encode(arg)
 def __cut(sentence):
-    prob, pos_list =  viterbi.viterbi(sentence,char_state_tab.P, prob_start.P, prob_trans.P, prob_emit.P)
+    prob, pos_list =  viterbi.viterbi(sentence,char_state_tab_P, start_P, trans_P, emit_P)
    begin, next = 0,0
    for i,char in enumerate(sentence):
@ -88,7 +121,7 @@ def __cut_detail(sentence):
 def __cut_DAG(sentence):
    DAG = jieba.get_DAG(sentence)
    route ={}
-    
+
    jieba.calc(sentence,DAG,0,route=route)
    x = 0
@ -105,7 +138,7 @@ def __cut_DAG(sentence):
                    yield pair(buf,word_tag_tab.get(buf,'x'))
                    buf=''
                else:
-                    if not (buf in jieba.FREQ):
+                    if (buf not in jieba.FREQ):
                        regognized = __cut_detail(buf)
                        for t in regognized:
                            yield t
@ -120,7 +153,7 @@ def __cut_DAG(sentence):
        if len(buf)==1:
            yield pair(buf,word_tag_tab.get(buf,'x'))
        else:
-            if not (buf in jieba.FREQ):
+            if (buf not in jieba.FREQ):
                regognized = __cut_detail(buf)
                for t in regognized:
                    yield t
@ -129,7 +162,7 @@ def __cut_DAG(sentence):
                    yield pair(elem,word_tag_tab.get(elem,'x'))
 def __cut_internal(sentence):
-    if not ( type(sentence) is str):
+    if not isinstance(sentence, str):
        try:
            sentence = sentence.decode('utf-8')
        except:
@ -166,7 +199,7 @@ def cut(sentence):
            yield w
    else:
        parts = re.compile('([\r\n]+)').split(sentence)
-        result = jieba.pool.map(__lcut_internal,parts)    
+        result = jieba.pool.map(__lcut_internal,parts)
        for r in result:
            for w in r:
                yield w
--- a/jieba/posseg/char_state_tab.p
+++ b/jieba/posseg/char_state_tab.p
--- a/jieba/posseg/prob_emit.p
+++ b/jieba/posseg/prob_emit.p
--- a/jieba/posseg/prob_start.p
+++ b/jieba/posseg/prob_start.p
--- a/jieba/posseg/prob_trans.p
+++ b/jieba/posseg/prob_trans.p
--- a/setup.py
+++ b/setup.py
@ -1,6 +1,6 @@
 from distutils.core import setup  
 setup(name='jieba',  
-      version='0.29.1',  
+      version='0.31',  
      description='Chinese Words Segementation Utilities',  
      author='Sun, Junyi',  
      author_email='ccnusjy@gmail.com',  
--- a/test/parallel/test_file.py
+++ b/test/parallel/test_file.py
@ -2,18 +2,20 @@ import sys,time
 import sys
 sys.path.append("../../")
 import jieba
-jieba.enable_parallel(4)
+
 jieba.enable_parallel()
 url = sys.argv[1]
-content = open(url,"rb").read()
+with open(url,"rb") as content:
-t1 = time.time()
+    content = content.read()
-words = list(jieba.cut(content))
+    t1 = time.time()
    words = "/ ".join(jieba.cut(content))
    t2 = time.time()
    tm_cost = t2-t1
    print('cost',tm_cost)
    print('speed' , len(content)/tm_cost, " bytes/second")
-t2 = time.time()
+with open("1.log","wb") as log_f:
-tm_cost = t2-t1
+    log_f.write(words.encode('utf-8'))
 log_f = open("1.log","wb")
 for w in words:
    log_f.write(w.encode("utf-8"))
 print('speed' , len(content)/tm_cost, " bytes/second")
--- a/test/test_file.py
+++ b/test/test_file.py
@ -5,17 +5,15 @@ import jieba
 jieba.initialize()
 url = sys.argv[1]
-content = open(url,"rb").read()
+with open(url,"rb") as content:
-t1 = time.time()
+    content = content.read()
-words = list(jieba.cut(content))
+    t1 = time.time()
-
+    words = "/ ".join(jieba.cut(content))
-t2 = time.time()
+    t2 = time.time()
-tm_cost = t2-t1
+    tm_cost = t2-t1
-
+    print('cost',tm_cost)
-log_f = open("1.log","wb")
+    print('speed' , len(content)/tm_cost, " bytes/second")
 log_f.write(bytes("/ ".join(words),'utf-8'))
 print('cost',tm_cost)
 print('speed' , len(content)/tm_cost, " bytes/second")
 with open("1.log","wb") as log_f:
    log_f.write(words.encode('utf-8'))
    log_f.write(bytes("/ ".join(words),'utf-8'))