Update README.md update paddle link. (#817 )

fix setup.py in python2.7
update version: 0.42
2025-07-10 00:01:33 +08:00 · 2020-02-15 16:33:35 +08:00 · 2020-01-20 22:22:34 +08:00 · 2020-01-13 21:24:45 +08:00 · 2020-01-13 21:03:38 +08:00 · 2020-01-13 20:53:43 +08:00
59 changed files with 23213 additions and 911 deletions
--- a/37
+++ b/37
@ -1,3 +1,40 @@
 2019-1-20: version 0.42.1
 1. 修复setup.py在python2.7版本无法工作的问题 (issue #809)
 2019-1-13: version 0.42
 1. 修复paddle模式空字符串coredump问题 @JesseyXujin
 2. 修复cut_all模式切分丢字问题 @fxsjy
 3. paddle安装检测优化 @vissssa
 2019-1-8: version 0.41
 1. 开启paddle模式更友好
 2. 修复cut_all模式不支持中英混合词的bug
 2019-12-25: version 0.40
 1. 支持基于paddle的深度学习分词模式(use_paddle=True); by @JesseyXujin, @xyzhou-puck
 2. 修复自定义Tokenizer实例的add_word方法指向全局的问题; by @linhx13 
 3. 修复whoosh测试用例的引用bug; by @ZhengZixiang
 4. 修复自定义词库不支持含"-"符号的问题；by @JimCurryWang 
 2017-08-28: version 0.39
 1. del_word支持强行拆开词语;  by @gumblex,@fxsjy
 2. 修复百分数的切词; by @fxsjy
 3. 修复HMM=False在多进程模式下的bug; by @huntzhan
 2015-12-16: version 0.38
 1. 通过pkg_resources载入默认词典，支持在Spark等平台上运行, by @gumblex;
 2. 扩充识别的汉字unicode范围：[\u4E00-\u9FD5], by @gumblex;
 3. 关键词提取支持返回词性，修复posseg分词得到的pair做dict关键字的问题，by @jerryday；
 4. 修复load_userdict加载用户词典不能识别含有空格等特殊字符的问题， by @gumblex;
 5. 命令行分词支持返回词性， by @gumblex;
 2015-06-27: version 0.37
 1. 代码重构，分词器封装为Class，支持实例化，by @gumblex (https://github.com/fxsjy/jieba/commit/94840a734c32cfece05c0c3ec236ffc3d36b4ae6)
 2. 修复cut_for_search的bug，完善posseg； by @gumblex
 3. 修复posseg在0.36中引入的一处bug; by @wangbin
 4. 修复load_userdict异常处理的bug; by @gip0
 5. 修复生成词典二进制cache文件时跨文件系统的bug, 支持自定义; by @gumblex 
 2015-03-20: version 0.36
 1. 代码同时兼容python2与python3, 若干性能优化; by @gumblex 
 2. 解决用户添加词的概率自动计算问题，分词更加准确；by @gumblex 
--- a/README.md
+++ b/README.md
@ -9,24 +9,15 @@ jieba
 特点
 ========
-* 支持三种分词模式：
+* 支持四种分词模式：
    * 精确模式，试图将句子最精确地切开，适合文本分析；
    * 全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义；
    * 搜索引擎模式，在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词。
-
+    * paddle模式，利用PaddlePaddle深度学习框架，训练序列标注（双向GRU）网络模型实现分词。同时支持词性标注。paddle模式使用需安装paddlepaddle-tiny，`pip install paddlepaddle-tiny==1.6.1`。目前paddle模式支持jieba v0.40及以上版本。jieba v0.40以下版本，请升级jieba，`pip install jieba --upgrade` 。[PaddlePaddle官网](https://www.paddlepaddle.org.cn/)
 * 支持繁体分词
 * 支持自定义词典
 * MIT 授权协议
 在线演示
 =========
 http://jiebademo.ap01.aws.af.cm/
 (Powered by Appfog)
 网站代码：https://github.com/fxsjy/jiebademo
 安装说明
 =======
@ -36,6 +27,7 @@ http://jiebademo.ap01.aws.af.cm/
 * 半自动安装：先下载 http://pypi.python.org/pypi/jieba/ ，解压后运行 `python setup.py install`
 * 手动安装：将 jieba 目录放置于当前目录或者 site-packages 目录
 * 通过 `import jieba` 来引用
 * 如果需要使用paddle模式下的分词和词性标注功能，请先安装paddlepaddle-tiny，`pip install paddlepaddle-tiny==1.6.1`。
 算法
 ========
@ -45,19 +37,27 @@ http://jiebademo.ap01.aws.af.cm/
 主要功能
 =======
-1) ：分词
+1. 分词
 --------
-* `jieba.cut` 方法接受三个输入参数: 需要分词的字符串；cut_all 参数用来控制是否采用全模式；HMM 参数用来控制是否使用 HMM 模型
+* `jieba.cut` 方法接受四个输入参数: 需要分词的字符串；cut_all 参数用来控制是否采用全模式；HMM 参数用来控制是否使用 HMM 模型；use_paddle 参数用来控制是否使用paddle模式下的分词模式，paddle模式采用延迟加载方式，通过enable_paddle接口安装paddlepaddle-tiny，并且import相关代码；
 * `jieba.cut_for_search` 方法接受两个参数：需要分词的字符串；是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词，粒度比较细
 * 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意：不建议直接输入 GBK 字符串，可能无法预料地错误解码成 UTF-8
-* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator，可以使用 for 循环来获得分词后得到的每一个词语(unicode)，也可以用 list(jieba.cut(...)) 转化为 list
+* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator，可以使用 for 循环来获得分词后得到的每一个词语(unicode)，或者用
 * `jieba.lcut` 以及 `jieba.lcut_for_search` 直接返回 list
 * `jieba.Tokenizer(dictionary=DEFAULT_DICT)` 新建自定义分词器，可用于同时使用不同词典。`jieba.dt` 为默认分词器，所有全局分词相关函数都是该分词器的映射。
-代码示例( 分词 )
+代码示例
 ```python
 # encoding=utf-8
 import jieba
 jieba.enable_paddle()# 启动paddle模式。 0.40版之后开始支持，早期版本不支持
 strs=["我来到北京清华大学","乒乓球拍卖完了","中国科学技术大学"]
 for str in strs:
    seg_list = jieba.cut(str,use_paddle=True) # 使用paddle模式
    print("Paddle Mode: " + '/'.join(list(seg_list)))
 seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
 print("Full Mode: " + "/ ".join(seg_list))  # 全模式
@ -81,15 +81,26 @@ print(", ".join(seg_list))
    【搜索引擎模式】： 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
-2) ：添加自定义词典
+2. 添加自定义词典
 ----------------
 ### 载入词典
 * 开发者可以指定自己自定义的词典，以便包含 jieba 词库里没有的词。虽然 jieba 有新词识别能力，但是自行添加新词可以保证更高的正确率
-* 用法： jieba.load_userdict(file_name) # file_name 为自定义词典的路径
+* 用法： jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
-* 词典格式和`dict.txt`一样，一个词占一行；每一行分三部分，一部分为词语，另一部分为词频（可省略），最后为词性（可省略），用空格隔开
+* 词典格式和 `dict.txt` 一样，一个词占一行；每一行分三部分：词语、词频（可省略）、词性（可省略），用空格隔开，顺序不可颠倒。`file_name` 若为路径或二进制方式打开的文件，则文件必须为 UTF-8 编码。
-* 词频可省略，使用计算出的能保证分出该词的词频
+* 词频省略时使用自动计算的能保证分出该词的词频。
 **例如：**
 ```
 创新办 3 i
 云计算 5
 凱特琳 nz
 台中
 ```
 * 更改分词器（默认为 `jieba.dt`）的 `tmp_dir` 和 `cache_file` 属性，可分别指定缓存文件所在的文件夹及其文件名，用于受限的文件系统。
 * 范例：
@ -128,12 +139,18 @@ print(", ".join(seg_list))
 * "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
-3) ：关键词提取
+3. 关键词提取
 -------------
-* jieba.analyse.extract_tags(sentence,topK,withWeight) #需要先 `import jieba.analyse`
+### 基于 TF-IDF 算法的关键词抽取
 `import jieba.analyse`
 * jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
  * sentence 为待提取的文本
  * topK 为返回几个 TF/IDF 权重最大的关键词，默认值为 20
  * withWeight 为是否一并返回关键词权重值，默认值为 False
  * allowPOS 仅包括指定词性的词，默认值为空，即不筛选
 * jieba.analyse.TFIDF(idf_path=None) 新建 TFIDF 实例，idf_path 为 IDF 频率文件
 代码示例 （关键词提取）
@ -155,44 +172,38 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
 * 用法示例：https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py
-#### 基于TextRank算法的关键词抽取实现
+### 基于 TextRank 算法的关键词抽取
 * jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')) 直接使用，接口相同，注意默认过滤词性。
 * jieba.analyse.TextRank() 新建自定义 TextRank 实例
 算法论文： [TextRank: Bringing Order into Texts](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
-##### 基本思想:
+#### 基本思想:
 1. 将待抽取关键词的文本进行分词
-2. 以固定窗口大小(我选的5，可适当调整)，词之间的共现关系，构建图
+2. 以固定窗口大小(默认为5，通过span属性调整)，词之间的共现关系，构建图
 3. 计算图中节点的PageRank，注意是无向带权图
-##### 基本使用:
+#### 使用示例:
 jieba.analyse.textrank(raw_text)
-##### 示例结果:
+见 [test/demo.py](https://github.com/fxsjy/jieba/blob/master/test/demo.py)
 来自`__main__`的示例结果：
-```
+4. 词性标注
 吉林 1.0
 欧亚 0.864834432786
 置业 0.553465925497
 实现 0.520660869531
 收入 0.379699688954
 增资 0.355086023683
 子公司 0.349758490263
 全资 0.308537396283
 城市 0.306103738053
 商业 0.304837414946
 ```
 4) : 词性标注
 -----------
-* 标注句子分词后每个词的词性，采用和 ictclas 兼容的标记法
+* `jieba.posseg.POSTokenizer(tokenizer=None)` 新建自定义分词器，`tokenizer` 参数可指定内部使用的 `jieba.Tokenizer` 分词器。`jieba.posseg.dt` 为默认词性标注分词器。
 * 标注句子分词后每个词的词性，采用和 ictclas 兼容的标记法。
 * 除了jieba默认分词模式，提供paddle模式下的词性标注功能。paddle模式采用延迟加载方式，通过enable_paddle()安装paddlepaddle-tiny，并且import相关代码；
 * 用法示例
 ```pycon
 >>> import jieba
 >>> import jieba.posseg as pseg
->>> words = pseg.cut("我爱北京天安门")
+>>> words = pseg.cut("我爱北京天安门") #jieba默认模式
->>> for w in words:
+>>> jieba.enable_paddle() #启动paddle模式。 0.40版之后开始支持，早期版本不支持
-...    print('%s %s' % (w.word, w.flag))
+>>> words = pseg.cut("我爱北京天安门",use_paddle=True) #paddle模式
 >>> for word, flag in words:
 ...    print('%s %s' % (word, flag))
 ...
 我 r
 爱 v
@ -200,10 +211,25 @@ jieba.analyse.textrank(raw_text)
 天安门 ns
 ```
-5) : 并行分词
+paddle模式词性标注对应表如下：
 paddle模式词性和专名类别标签集合如下表，其中词性标签 24 个（小写字母），专名类别标签 4 个（大写字母）。
 | 标签 | 含义     | 标签 | 含义     | 标签 | 含义     | 标签 | 含义     |
 | ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
 | n    | 普通名词 | f    | 方位名词 | s    | 处所名词 | t    | 时间     |
 | nr   | 人名     | ns   | 地名     | nt   | 机构名   | nw   | 作品名   |
 | nz   | 其他专名 | v    | 普通动词 | vd   | 动副词   | vn   | 名动词   |
 | a    | 形容词   | ad   | 副形词   | an   | 名形词   | d    | 副词     |
 | m    | 数量词   | q    | 量词     | r    | 代词     | p    | 介词     |
 | c    | 连词     | u    | 助词     | xc   | 其他虚词 | w    | 标点符号 |
 | PER  | 人名     | LOC  | 地名     | ORG  | 机构名   | TIME | 时间     |
 5. 并行分词
 -----------
-* 原理：将目标文本按行分隔后，把各行文本分配到多个 python 进程并行分词，然后归并结果，从而获得分词速度的可观提升
+* 原理：将目标文本按行分隔后，把各行文本分配到多个 Python 进程并行分词，然后归并结果，从而获得分词速度的可观提升
-* 基于 python 自带的 multiprocessing 模块，目前暂不支持 windows
+* 基于 python 自带的 multiprocessing 模块，目前暂不支持 Windows
 * 用法：
    * `jieba.enable_parallel(4)` # 开启并行分词模式，参数为并行进程数
    * `jieba.disable_parallel()` # 关闭并行分词模式
@ -212,8 +238,9 @@ jieba.analyse.textrank(raw_text)
 * 实验结果：在 4 核 3.4GHz Linux 机器上，对金庸全集进行精确分词，获得了 1MB/s 的速度，是单进程版的 3.3 倍。
 * **注意**：并行分词仅支持默认分词器 `jieba.dt` 和 `jieba.posseg.dt`。
-6) : Tokenize：返回词语在原文的起始位置
+6. Tokenize：返回词语在原文的起止位置
 ----------------------------------
 * 注意，输入参数只接受 unicode
 * 默认模式
@ -250,15 +277,15 @@ word 有限公司            start: 6                end:10
 ```
-7) : ChineseAnalyzer for Whoosh 搜索引擎
+7. ChineseAnalyzer for Whoosh 搜索引擎
 --------------------------------------------
 * 引用： `from jieba.analyse import ChineseAnalyzer`
 * 用法示例：https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
-8) : 命令行分词
+8. 命令行分词
 -------------------
-使用示例：`cat news.txt | python -m jieba > cut_result.txt`
+使用示例：`python -m jieba news.txt > cut_result.txt`
 命令行选项（翻译）：
@ -274,10 +301,13 @@ word 有限公司            start: 6                end:10
      -d [DELIM], --delimiter [DELIM]
                            使用 DELIM 分隔词语，而不是用默认的' / '。
                            若不指定 DELIM，则使用一个空格分隔。
      -p [DELIM], --pos [DELIM]
                            启用词性标注；如果指定 DELIM，词语和词性之间
                            用它分隔，否则用 _ 分隔
      -D DICT, --dict DICT  使用 DICT 代替默认词典
      -u USER_DICT, --user-dict USER_DICT
                            使用 USER_DICT 作为附加词典，与默认词典或自定义词典配合使用
-      -a, --cut-all         全模式分词
+      -a, --cut-all         全模式分词（不支持词性标注）
      -n, --no-hmm          不使用隐含马尔可夫模型
      -q, --quiet           不输出载入信息到 STDERR
      -V, --version         显示版本信息并退出
@ -287,8 +317,6 @@ word 有限公司            start: 6                end:10
 `--help` 选项输出：
    $> python -m jieba --help
    usage: python -m jieba [options] filename
    Jieba command line interface.
    positional arguments:
@ -299,21 +327,24 @@ word 有限公司            start: 6                end:10
      -d [DELIM], --delimiter [DELIM]
                            use DELIM instead of ' / ' for word delimiter; or a
                            space if it is used without DELIM
      -p [DELIM], --pos [DELIM]
                            enable POS tagging; if DELIM is specified, use DELIM
                            instead of '_' for POS delimiter
      -D DICT, --dict DICT  use DICT as dictionary
      -u USER_DICT, --user-dict USER_DICT
                            use USER_DICT together with the default dictionary or
                            DICT (if specified)
-      -a, --cut-all         full pattern cutting
+      -a, --cut-all         full pattern cutting (ignored with POS tagging)
      -n, --no-hmm          don't use the Hidden Markov Model
      -q, --quiet           don't print loading messages to stderr
      -V, --version         show program's version number and exit
    If no filename specified, use STDIN instead.
-模块初始化机制的改变:lazy load （从0.28版本开始）
+延迟加载机制
-------------------------------------------
+------------
-jieba 采用延迟加载，"import jieba" 不会立即触发词典的加载，一旦有必要才开始加载词典构建前缀字典。如果你想手工初始 jieba，也可以手动初始化。
+jieba 采用延迟加载，`import jieba` 和 `jieba.Tokenizer()` 不会立即触发词典的加载，一旦有必要才开始加载词典构建前缀字典。如果你想手工初始 jieba，也可以手动初始化。
    import jieba
    jieba.initialize()  # 手动初始化（可选）
@ -348,6 +379,11 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
 作者：yanyiwu
 地址：https://github.com/yanyiwu/cppjieba
 结巴分词 Rust 版本
 ----------------
 作者：messense, MnO2
 地址：https://github.com/messense/jieba-rs
 结巴分词 Node.js 版本
 ----------------
 作者：yanyiwu
@ -368,6 +404,33 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
 作者：yanyiwu
 地址：https://github.com/yanyiwu/iosjieba
 结巴分词 PHP 版本
 ----------------
 作者：fukuball
 地址：https://github.com/fukuball/jieba-php
 结巴分词 .NET(C#) 版本
 ----------------
 作者：anderscui
 地址：https://github.com/anderscui/jieba.NET/
 结巴分词 Go 版本
 ----------------
 + 作者: wangbin 地址: https://github.com/wangbin/jiebago
 + 作者: yanyiwu 地址: https://github.com/yanyiwu/gojieba
 结巴分词Android版本
 ------------------
 + 作者   Dongliang.W  地址：https://github.com/452896915/jieba-android
 友情链接
 =========
 * https://github.com/baidu/lac   百度中文词法分析（分词+词性+专名）系统
 * https://github.com/baidu/AnyQ  百度FAQ自动问答系统
 * https://github.com/baidu/Senta 百度情感识别系统
 系统集成
 ========
 1. Solr: https://github.com/sing1ee/jieba-solr
@ -455,12 +518,15 @@ Algorithm
 Main Functions
 ==============
-1) : Cut
+1. Cut
 --------
 * The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
 * `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
 * The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
-* `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode), or `list(jieba.cut( ... ))` to create a list.
+* `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode).
 * `jieba.lcut` and `jieba.lcut_for_search` returns a list.
 * `jieba.Tokenizer(dictionary=DEFAULT_DICT)` creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. `jieba.dt` is the default Tokenizer, to which almost all global functions are mapped.
 **Code example: segmentation**
@ -492,15 +558,29 @@ Output:
    [Search Engine Mode]： 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
-2) : Add a custom dictionary
+2. Add a custom dictionary
 ----------------------------
-###　Load dictionary
+### Load dictionary
-* Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but adding your own new words can ensure a higher accuracy.
+* Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but you can add your own new words can ensure a higher accuracy.
-* Usage： `jieba.load_userdict(file_name) # file_name is the path of the custom dictionary`
+* Usage： `jieba.load_userdict(file_name)` # file_name is a file-like object or the path of the custom dictionary
-* The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space
+* The dictionary format is the same as that of `dict.txt`: one word per line; each line is divided into three parts separated by a space: word, word frequency, POS tag. If `file_name` is a path or a file opened in binary mode, the dictionary must be UTF-8 encoded.
-* Example：
+* The word frequency and POS tag can be omitted respectively. The word frequency will be filled with a suitable value if omitted.
 **For example:**
 ```
 创新办 3 i
 云计算 5
 凱特琳 nz
 台中
 ```
 * Change a Tokenizer's `tmp_dir` and `cache_file` to specify the path of the cache file, for using on a restricted file system.
 * Example:
        云计算 5
        李小福 2
@ -535,12 +615,16 @@ Example:
 「/台中/」/正确/应该/不会/被/切开
 ```
-3) : Keyword Extraction
+3. Keyword Extraction
 -----------------------
-* `jieba.analyse.extract_tags(sentence,topK,withWeight) # needs to first import jieba.analyse`
+`import jieba.analyse`
 * `jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())`
  * `sentence`: the text to be extracted
  * `topK`: return how many keywords with the highest TF/IDF weights. The default value is 20
  * `withWeight`: whether return TF/IDF weights with the keywords. The default value is False
  * `allowPOS`: filter words with which POSs are included. Empty for no filtering.
 * `jieba.analyse.TFIDF(idf_path=None)` creates a new TFIDF instance, `idf_path` specifies IDF file path.
 Example (keyword extraction)
@ -560,10 +644,15 @@ Developers can specify their own custom stop words corpus in jieba keyword extra
 There's also a [TextRank](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) implementation available.
-Use: `jieba.analyse.textrank(raw_text)`.
+Use: `jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))`
-4) : Part of Speech Tagging
+Note that it filters POS by default.
-----------
+
 `jieba.analyse.TextRank()` creates a new TextRank instance.
 4. Part of Speech Tagging
 -------------------------
 * `jieba.posseg.POSTokenizer(tokenizer=None)` creates a new customized Tokenizer. `tokenizer` specifies the jieba.Tokenizer to internally use. `jieba.posseg.dt` is the default POSTokenizer.
 * Tags the POS of each word after segmentation, using labels compatible with ictclas.
 * Example:
@ -579,8 +668,8 @@ Use: `jieba.analyse.textrank(raw_text)`.
 天安门 ns
 ```
-5) : Parallel Processing
+5. Parallel Processing
-----------
+----------------------
 * Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
 * Based on the multiprocessing module of Python.
 * Usage:
@ -592,8 +681,10 @@ Use: `jieba.analyse.textrank(raw_text)`.
 * Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
-6) : Tokenize: return words with position
+* **Note** that parallel processing supports only default tokenizers, `jieba.dt` and `jieba.posseg.dt`.
----------------------------------
+
 6. Tokenize: return words with position
 ----------------------------------------
 * The input must be unicode
 * Default mode
@ -629,17 +720,15 @@ word 有限公司            start: 6                end:10
 ```
-7) : ChineseAnalyzer for Whoosh
+7. ChineseAnalyzer for Whoosh
--------------------------------------------
+-------------------------------
 * `from jieba.analyse import ChineseAnalyzer`
 * Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
-8) : Command Line Interface
+8. Command Line Interface
-------------------
+--------------------------------
    $> python -m jieba --help
    usage: python -m jieba [options] filename
    Jieba command line interface.
    positional arguments:
@ -650,11 +739,14 @@ word 有限公司            start: 6                end:10
      -d [DELIM], --delimiter [DELIM]
                            use DELIM instead of ' / ' for word delimiter; or a
                            space if it is used without DELIM
      -p [DELIM], --pos [DELIM]
                            enable POS tagging; if DELIM is specified, use DELIM
                            instead of '_' for POS delimiter
      -D DICT, --dict DICT  use DICT as dictionary
      -u USER_DICT, --user-dict USER_DICT
                            use USER_DICT together with the default dictionary or
                            DICT (if specified)
-      -a, --cut-all         full pattern cutting
+      -a, --cut-all         full pattern cutting (ignored with POS tagging)
      -n, --no-hmm          don't use the Hidden Markov Model
      -q, --quiet           don't print loading messages to stderr
      -V, --version         show program's version number and exit
@ -674,7 +766,8 @@ You can also specify the dictionary (not supported before version 0.28) :
 Using Other Dictionaries
-========
+===========================
 It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:
 1. A smaller dictionary for a smaller memory footprint:
--- a/jieba/init.py
+++ b/jieba/init.py
@ -1,52 +1,81 @@
 from __future__ import absolute_import, unicode_literals
-__version__ = '0.36'
+
 __version__ = '0.42.1'
 __license__ = 'MIT'
 import re
 import os
 import sys
 import time
 import tempfile
 import marshal
-from math import log
+import re
 import tempfile
 import threading
-from functools import wraps
+import time
 import logging
 from hashlib import md5
-from ._compat import *
+from math import log
 from . import finalseg
 from ._compat import *
-DICTIONARY = "dict.txt"
+if os.name == 'nt':
-DICT_LOCK = threading.RLock()
+    from shutil import move as _replace_file
-FREQ = {}  # to be initialized
+else:
-total = 0
+    _replace_file = os.rename
 user_word_tag_tab = {}
 initialized = False
 pool = None
 tmp_dir = None
-_curpath = os.path.normpath(
+_get_abs_path = lambda path: os.path.normpath(os.path.join(os.getcwd(), path))
-    os.path.join(os.getcwd(), os.path.dirname(__file__)))
+
 DEFAULT_DICT = None
 DEFAULT_DICT_NAME = "dict.txt"
 log_console = logging.StreamHandler(sys.stderr)
-logger = logging.getLogger(__name__)
+default_logger = logging.getLogger(__name__)
-logger.setLevel(logging.DEBUG)
+default_logger.setLevel(logging.DEBUG)
-logger.addHandler(log_console)
+default_logger.addHandler(log_console)
 DICT_WRITING = {}
 pool = None
 re_userdict = re.compile('^(.+?)( [0-9]+)?( [a-z]+)?$', re.U)
 re_eng = re.compile('[a-zA-Z0-9]', re.U)
 # \u4E00-\u9FD5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
 # \r\n|\s : whitespace characters. Will not be handled.
 # re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
 # Adding "-" symbol in re_han_default
 re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
 re_skip_default = re.compile("(\r\n|\s)", re.U)
 def setLogLevel(log_level):
-    global logger
+    default_logger.setLevel(log_level)
    logger.setLevel(log_level)
-def gen_pfdict(f_name):
+class Tokenizer(object):
    def __init__(self, dictionary=DEFAULT_DICT):
        self.lock = threading.RLock()
        if dictionary == DEFAULT_DICT:
            self.dictionary = dictionary
        else:
            self.dictionary = _get_abs_path(dictionary)
        self.FREQ = {}
        self.total = 0
        self.user_word_tag_tab = {}
        self.initialized = False
        self.tmp_dir = None
        self.cache_file = None
    def __repr__(self):
        return '<Tokenizer dictionary=%r>' % self.dictionary
    @staticmethod
    def gen_pfdict(f):
        lfreq = {}
        ltotal = 0
-    with open(f_name, 'rb') as f:
+        f_name = resolve_filename(f)
-        lineno = 0
+        for lineno, line in enumerate(f, 1):
        for line in f.read().rstrip().decode('utf-8').splitlines():
            lineno += 1
            try:
                line = line.strip().decode('utf-8')
                word, freq = line.split(' ')[:2]
                freq = int(freq)
                lfreq[word] = freq
@ -55,109 +84,109 @@ def gen_pfdict(f_name):
                    wfrag = word[:ch + 1]
                    if wfrag not in lfreq:
                        lfreq[wfrag] = 0
-            except ValueError as e:
+            except ValueError:
-                logger.debug('%s at line %s %s' % (f_name, lineno, line))
+                raise ValueError(
-                raise e
+                    'invalid dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
        f.close()
        return lfreq, ltotal
    def initialize(self, dictionary=None):
        if dictionary:
            abs_path = _get_abs_path(dictionary)
            if self.dictionary == abs_path and self.initialized:
                return
            else:
                self.dictionary = abs_path
                self.initialized = False
        else:
            abs_path = self.dictionary
-def initialize(dictionary=None):
+        with self.lock:
-    global FREQ, total, initialized, DICTIONARY, DICT_LOCK, tmp_dir
+            try:
-    if not dictionary:
+                with DICT_WRITING[abs_path]:
-        dictionary = DICTIONARY
+                    pass
-    with DICT_LOCK:
+            except KeyError:
-        if initialized:
+                pass
            if self.initialized:
                return
-        abs_path = os.path.join(_curpath, dictionary)
+            default_logger.debug("Building prefix dict from %s ..." % (abs_path or 'the default dictionary'))
        logger.debug("Building prefix dict from %s ..." % abs_path)
            t1 = time.time()
            if self.cache_file:
                cache_file = self.cache_file
            # default dictionary
-        if abs_path == os.path.join(_curpath, "dict.txt"):
+            elif abs_path == DEFAULT_DICT:
-            cache_file = os.path.join(tmp_dir if tmp_dir else tempfile.gettempdir(),"jieba.cache")
+                cache_file = "jieba.cache"
-        else:  # custom dictionary
+            # custom dictionary
-            cache_file = os.path.join(tmp_dir if tmp_dir else tempfile.gettempdir(),"jieba.u%s.cache" % md5(
+            else:
-                abs_path.encode('utf-8', 'replace')).hexdigest())
+                cache_file = "jieba.u%s.cache" % md5(
                    abs_path.encode('utf-8', 'replace')).hexdigest()
            cache_file = os.path.join(
                self.tmp_dir or tempfile.gettempdir(), cache_file)
            # prevent absolute path in self.cache_file
            tmpdir = os.path.dirname(cache_file)
            load_from_cache_fail = True
-        if os.path.isfile(cache_file) and os.path.getmtime(cache_file) > os.path.getmtime(abs_path):
+            if os.path.isfile(cache_file) and (abs_path == DEFAULT_DICT or
-            logger.debug("Loading model from cache %s" % cache_file)
+                                               os.path.getmtime(cache_file) > os.path.getmtime(abs_path)):
                default_logger.debug(
                    "Loading model from cache %s" % cache_file)
                try:
                    with open(cache_file, 'rb') as cf:
-                    FREQ, total = marshal.load(cf)
+                        self.FREQ, self.total = marshal.load(cf)
                    load_from_cache_fail = False
                except Exception:
                    load_from_cache_fail = True
            if load_from_cache_fail:
-            FREQ, total = gen_pfdict(abs_path)
+                wlock = DICT_WRITING.get(abs_path, threading.RLock())
-            logger.debug("Dumping model to file cache %s" % cache_file)
+                DICT_WRITING[abs_path] = wlock
                with wlock:
                    self.FREQ, self.total = self.gen_pfdict(self.get_dict_file())
                    default_logger.debug(
                        "Dumping model to file cache %s" % cache_file)
                    try:
-                fd, fpath = tempfile.mkstemp()
+                        # prevent moving across different filesystems
                        fd, fpath = tempfile.mkstemp(dir=tmpdir)
                        with os.fdopen(fd, 'wb') as temp_cache_file:
-                    marshal.dump((FREQ, total), temp_cache_file)
+                            marshal.dump(
-                if os.name == 'nt':
+                                (self.FREQ, self.total), temp_cache_file)
-                    from shutil import move as replace_file
+                        _replace_file(fpath, cache_file)
                else:
                    replace_file = os.rename
                replace_file(fpath, cache_file)
                    except Exception:
-                logger.exception("Dump cache file failed.")
+                        default_logger.exception("Dump cache file failed.")
-        initialized = True
+                try:
                    del DICT_WRITING[abs_path]
                except KeyError:
                    pass
-        logger.debug("Loading model cost %s seconds." % (time.time() - t1))
+            self.initialized = True
-        logger.debug("Prefix dict has been built succesfully.")
+            default_logger.debug(
                "Loading model cost %.3f seconds." % (time.time() - t1))
            default_logger.debug("Prefix dict has been built successfully.")
    def check_initialized(self):
        if not self.initialized:
            self.initialize()
-def require_initialized(fn):
+    def calc(self, sentence, DAG, route):
    @wraps(fn)
    def wrapped(*args, **kwargs):
        global initialized
        if initialized:
            return fn(*args, **kwargs)
        else:
            initialize(DICTIONARY)
            return fn(*args, **kwargs)
    return wrapped
 def __cut_all(sentence):
    dag = get_DAG(sentence)
    old_j = -1
    for k, L in iteritems(dag):
        if len(L) == 1 and k > old_j:
            yield sentence[k:L[0] + 1]
            old_j = L[0]
        else:
            for j in L:
                if j > k:
                    yield sentence[k:j + 1]
                    old_j = j
 def calc(sentence, DAG, route):
        N = len(sentence)
        route[N] = (0, 0)
-    logtotal = log(total)
+        logtotal = log(self.total)
        for idx in xrange(N - 1, -1, -1):
-        route[idx] = max((log(FREQ.get(sentence[idx:x + 1]) or 1) -
+            route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) -
                              logtotal + route[x + 1][0], x) for x in DAG[idx])
-
+    def get_DAG(self, sentence):
-@require_initialized
+        self.check_initialized()
 def get_DAG(sentence):
    global FREQ
        DAG = {}
        N = len(sentence)
        for k in xrange(N):
            tmplist = []
            i = k
            frag = sentence[k]
-        while i < N and frag in FREQ:
+            while i < N and frag in self.FREQ:
-            if FREQ[frag]:
+                if self.FREQ[frag]:
                    tmplist.append(i)
                i += 1
                frag = sentence[k:i + 1]
@ -166,13 +195,38 @@ def get_DAG(sentence):
            DAG[k] = tmplist
        return DAG
-re_eng = re.compile('[a-zA-Z0-9]', re.U)
+    def __cut_all(self, sentence):
        dag = self.get_DAG(sentence)
        old_j = -1
        eng_scan = 0
        eng_buf = u''
        for k, L in iteritems(dag):
            if eng_scan == 1 and not re_eng.match(sentence[k]):
                eng_scan = 0
                yield eng_buf
            if len(L) == 1 and k > old_j:
                word = sentence[k:L[0] + 1]
                if re_eng.match(word):
                    if eng_scan == 0:
                        eng_scan = 1
                        eng_buf = word
                    else:
                        eng_buf += word
                if eng_scan == 0:
                    yield word
                old_j = L[0]
            else:
                for j in L:
                    if j > k:
                        yield sentence[k:j + 1]
                        old_j = j
        if eng_scan == 1:
            yield eng_buf
-
+    def __cut_DAG_NO_HMM(self, sentence):
-def __cut_DAG_NO_HMM(sentence):
+        DAG = self.get_DAG(sentence)
    DAG = get_DAG(sentence)
        route = {}
-    calc(sentence, DAG, route)
+        self.calc(sentence, DAG, route)
        x = 0
        N = len(sentence)
        buf = ''
@ -192,11 +246,10 @@ def __cut_DAG_NO_HMM(sentence):
            yield buf
            buf = ''
-
+    def __cut_DAG(self, sentence):
-def __cut_DAG(sentence):
+        DAG = self.get_DAG(sentence)
    DAG = get_DAG(sentence)
        route = {}
-    calc(sentence, DAG, route=route)
+        self.calc(sentence, DAG, route)
        x = 0
        buf = ''
        N = len(sentence)
@ -211,7 +264,7 @@ def __cut_DAG(sentence):
                        yield buf
                        buf = ''
                    else:
-                    if not FREQ.get(buf):
+                        if not self.FREQ.get(buf):
                            recognized = finalseg.cut(buf)
                            for t in recognized:
                                yield t
@ -225,7 +278,7 @@ def __cut_DAG(sentence):
        if buf:
            if len(buf) == 1:
                yield buf
-        elif not FREQ.get(buf):
+            elif not self.FREQ.get(buf):
                recognized = finalseg.cut(buf)
                for t in recognized:
                    yield t
@ -233,40 +286,38 @@ def __cut_DAG(sentence):
                for elem in buf:
                    yield elem
-re_han_default = re.compile("([\u4E00-\u9FA5a-zA-Z0-9+#&\._]+)", re.U)
+    def cut(self, sentence, cut_all=False, HMM=True, use_paddle=False):
-re_skip_default = re.compile("(\r\n|\s)", re.U)
+        """
 re_han_cut_all = re.compile("([\u4E00-\u9FA5]+)", re.U)
 re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)
 def cut(sentence, cut_all=False, HMM=True):
    '''
        The main function that segments an entire sentence that contains
-    Chinese characters into seperated words.
+        Chinese characters into separated words.
        Parameter:
            - sentence: The str(unicode) to be segmented.
            - cut_all: Model type. True for full pattern, False for accurate pattern.
            - HMM: Whether to use the Hidden Markov Model.
-    '''
+        """
        is_paddle_installed = check_paddle_install['is_paddle_installed']
        sentence = strdecode(sentence)
-
+        if use_paddle and is_paddle_installed:
-    # \u4E00-\u9FA5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
+            # if sentence is null, it will raise core exception in paddle.
-    # \r\n|\s : whitespace characters. Will not be handled.
+            if sentence is None or len(sentence) == 0:
-
+                return
-    if cut_all:
+            import jieba.lac_small.predict as predict
-        re_han = re_han_cut_all
+            results = predict.get_sent(sentence)
-        re_skip = re_skip_cut_all
+            for sent in results:
-    else:
+                if sent is None:
                    continue
                yield sent
            return
        re_han = re_han_default
        re_skip = re_skip_default
    blocks = re_han.split(sentence)
        if cut_all:
-        cut_block = __cut_all
+            cut_block = self.__cut_all
        elif HMM:
-        cut_block = __cut_DAG
+            cut_block = self.__cut_DAG
        else:
-        cut_block = __cut_DAG_NO_HMM
+            cut_block = self.__cut_DAG_NO_HMM
        blocks = re_han.split(sentence)
        for blk in blocks:
            if not blk:
                continue
@ -284,33 +335,56 @@ def cut(sentence, cut_all=False, HMM=True):
                    else:
                        yield x
-
+    def cut_for_search(self, sentence, HMM=True):
 def cut_for_search(sentence, HMM=True):
        """
        Finer segmentation for search engines.
        """
-    words = cut(sentence, HMM=HMM)
+        words = self.cut(sentence, HMM=HMM)
        for w in words:
            if len(w) > 2:
                for i in xrange(len(w) - 1):
                    gram2 = w[i:i + 2]
-                if FREQ.get(gram2):
+                    if self.FREQ.get(gram2):
                        yield gram2
            if len(w) > 3:
                for i in xrange(len(w) - 2):
                    gram3 = w[i:i + 3]
-                if FREQ.get(gram3):
+                    if self.FREQ.get(gram3):
                        yield gram3
            yield w
    def lcut(self, *args, **kwargs):
        return list(self.cut(*args, **kwargs))
-@require_initialized
+    def lcut_for_search(self, *args, **kwargs):
-def load_userdict(f):
+        return list(self.cut_for_search(*args, **kwargs))
    _lcut = lcut
    _lcut_for_search = lcut_for_search
    def _lcut_no_hmm(self, sentence):
        return self.lcut(sentence, False, False)
    def _lcut_all(self, sentence):
        return self.lcut(sentence, True)
    def _lcut_for_search_no_hmm(self, sentence):
        return self.lcut_for_search(sentence, False)
    def get_dict_file(self):
        if self.dictionary == DEFAULT_DICT:
            return get_module_res(DEFAULT_DICT_NAME)
        else:
            return open(self.dictionary, 'rb')
    def load_userdict(self, f):
        '''
        Load personalized dict to improve detect rate.
        Parameter:
            - f : A plain text file contains words and their ocurrences.
                  Can be a file-like object, or the path of the dictionary file,
                  whose encoding must be utf-8.
        Structure of dict file:
        word1 freq1 word_type1
@ -318,56 +392,57 @@ def load_userdict(f):
        ...
        Word type may be ignored
        '''
        self.check_initialized()
        if isinstance(f, string_types):
            f_name = f
            f = open(f, 'rb')
-    content = f.read().decode('utf-8').lstrip('\ufeff')
+        else:
-    line_no = 0
+            f_name = resolve_filename(f)
-    for line in content.splitlines():
+        for lineno, ln in enumerate(f, 1):
            line = ln.strip()
            if not isinstance(line, text_type):
                try:
-            line_no += 1
+                    line = line.decode('utf-8').lstrip('\ufeff')
-            line = line.strip()
+                except UnicodeDecodeError:
                    raise ValueError('dictionary file %s must be utf-8' % f_name)
            if not line:
                continue
-            tup = line.split(" ")
+            # match won't be None because there's at least one character
-            add_word(*tup)
+            word, freq, tag = re_userdict.match(line).groups()
-        except Exception as e:
+            if freq is not None:
-            logger.debug('%s at line %s %s' % (f_name, lineno, line))
+                freq = freq.strip()
-            raise e
+            if tag is not None:
                tag = tag.strip()
            self.add_word(word, freq, tag)
-
+    def add_word(self, word, freq=None, tag=None):
@require_initialized
 def add_word(word, freq=None, tag=None):
        """
        Add a word to dictionary.
        freq and tag can be omitted, freq defaults to be a calculated value
        that ensures the word can be cut out.
        """
-    global FREQ, total, user_word_tag_tab
+        self.check_initialized()
        word = strdecode(word)
-    if freq is None:
+        freq = int(freq) if freq is not None else self.suggest_freq(word, False)
-        freq = suggest_freq(word, False)
+        self.FREQ[word] = freq
-    else:
+        self.total += freq
-        freq = int(freq)
+        if tag:
-    FREQ[word] = freq
+            self.user_word_tag_tab[word] = tag
    total += freq
    if tag is not None:
        user_word_tag_tab[word] = tag
        for ch in xrange(len(word)):
            wfrag = word[:ch + 1]
-        if wfrag not in FREQ:
+            if wfrag not in self.FREQ:
-            FREQ[wfrag] = 0
+                self.FREQ[wfrag] = 0
        if freq == 0:
            finalseg.add_force_split(word)
-
+    def del_word(self, word):
 def del_word(word):
        """
        Convenient function for deleting a word.
        """
-    add_word(word, 0)
+        self.add_word(word, 0)
-
+    def suggest_freq(self, segment, tune=False):
@require_initialized
 def suggest_freq(segment, tune=False):
        """
        Suggest word frequency to force the characters in a word to be
        joined or splitted.
@ -380,101 +455,25 @@ def suggest_freq(segment, tune=False):
        Note that HMM may affect the final result. If the result doesn't change,
        set HMM=False.
        """
-    ftotal = float(total)
+        self.check_initialized()
        ftotal = float(self.total)
        freq = 1
        if isinstance(segment, string_types):
            word = segment
-        for seg in cut(word, HMM=False):
+            for seg in self.cut(word, HMM=False):
-            freq *= FREQ.get(seg, 1) / ftotal
+                freq *= self.FREQ.get(seg, 1) / ftotal
-        freq = max(int(freq*total) + 1, FREQ.get(word, 1))
+            freq = max(int(freq * self.total) + 1, self.FREQ.get(word, 1))
        else:
            segment = tuple(map(strdecode, segment))
            word = ''.join(segment)
            for seg in segment:
-            freq *= FREQ.get(seg, 1) / ftotal
+                freq *= self.FREQ.get(seg, 1) / ftotal
-        freq = min(int(freq*total), FREQ.get(word, 0))
+            freq = min(int(freq * self.total), self.FREQ.get(word, 0))
        if tune:
-        add_word(word, freq)
+            self.add_word(word, freq)
        return freq
-
+    def tokenize(self, unicode_sentence, mode="default", HMM=True):
 __ref_cut = cut
 __ref_cut_for_search = cut_for_search
 def __lcut(sentence):
    return list(__ref_cut(sentence, False))
 def __lcut_no_hmm(sentence):
    return list(__ref_cut(sentence, False, False))
 def __lcut_all(sentence):
    return list(__ref_cut(sentence, True))
 def __lcut_for_search(sentence):
    return list(__ref_cut_for_search(sentence))
@require_initialized
 def enable_parallel(processnum=None):
    global pool, cut, cut_for_search
    if os.name == 'nt':
        raise Exception("jieba: parallel mode only supports posix system")
    from multiprocessing import Pool, cpu_count
    if processnum is None:
        processnum = cpu_count()
    pool = Pool(processnum)
    def pcut(sentence, cut_all=False, HMM=True):
        parts = strdecode(sentence).splitlines(True)
        if cut_all:
            result = pool.map(__lcut_all, parts)
        elif HMM:
            result = pool.map(__lcut, parts)
        else:
            result = pool.map(__lcut_no_hmm, parts)
        for r in result:
            for w in r:
                yield w
    def pcut_for_search(sentence):
        parts = strdecode(sentence).splitlines(True)
        result = pool.map(__lcut_for_search, parts)
        for r in result:
            for w in r:
                yield w
    cut = pcut
    cut_for_search = pcut_for_search
 def disable_parallel():
    global pool, cut, cut_for_search
    if pool:
        pool.close()
        pool = None
    cut = __ref_cut
    cut_for_search = __ref_cut_for_search
 def set_dictionary(dictionary_path):
    global initialized, DICTIONARY
    with DICT_LOCK:
        abs_path = os.path.normpath(os.path.join(os.getcwd(), dictionary_path))
        if not os.path.isfile(abs_path):
            raise Exception("jieba: file does not exist: " + abs_path)
        DICTIONARY = abs_path
        initialized = False
 def get_abs_path_dict():
    return os.path.join(_curpath, DICTIONARY)
 def tokenize(unicode_sentence, mode="default", HMM=True):
        """
        Tokenize a sentence and yields tuples of (word, start, end)
@ -484,25 +483,137 @@ def tokenize(unicode_sentence, mode="default", HMM=True):
            - HMM: whether to use the Hidden Markov Model.
        """
        if not isinstance(unicode_sentence, text_type):
-        raise Exception("jieba: the input parameter should be unicode.")
+            raise ValueError("jieba: the input parameter should be unicode.")
        start = 0
        if mode == 'default':
-        for w in cut(unicode_sentence, HMM=HMM):
+            for w in self.cut(unicode_sentence, HMM=HMM):
                width = len(w)
                yield (w, start, start + width)
                start += width
        else:
-        for w in cut(unicode_sentence, HMM=HMM):
+            for w in self.cut(unicode_sentence, HMM=HMM):
                width = len(w)
                if len(w) > 2:
                    for i in xrange(len(w) - 1):
                        gram2 = w[i:i + 2]
-                    if FREQ.get(gram2):
+                        if self.FREQ.get(gram2):
                            yield (gram2, start + i, start + i + 2)
                if len(w) > 3:
                    for i in xrange(len(w) - 2):
                        gram3 = w[i:i + 3]
-                    if FREQ.get(gram3):
+                        if self.FREQ.get(gram3):
                            yield (gram3, start + i, start + i + 3)
                yield (w, start, start + width)
                start += width
    def set_dictionary(self, dictionary_path):
        with self.lock:
            abs_path = _get_abs_path(dictionary_path)
            if not os.path.isfile(abs_path):
                raise Exception("jieba: file does not exist: " + abs_path)
            self.dictionary = abs_path
            self.initialized = False
 # default Tokenizer instance
 dt = Tokenizer()
 # global functions
 get_FREQ = lambda k, d=None: dt.FREQ.get(k, d)
 add_word = dt.add_word
 calc = dt.calc
 cut = dt.cut
 lcut = dt.lcut
 cut_for_search = dt.cut_for_search
 lcut_for_search = dt.lcut_for_search
 del_word = dt.del_word
 get_DAG = dt.get_DAG
 get_dict_file = dt.get_dict_file
 initialize = dt.initialize
 load_userdict = dt.load_userdict
 set_dictionary = dt.set_dictionary
 suggest_freq = dt.suggest_freq
 tokenize = dt.tokenize
 user_word_tag_tab = dt.user_word_tag_tab
 def _lcut_all(s):
    return dt._lcut_all(s)
 def _lcut(s):
    return dt._lcut(s)
 def _lcut_no_hmm(s):
    return dt._lcut_no_hmm(s)
 def _lcut_all(s):
    return dt._lcut_all(s)
 def _lcut_for_search(s):
    return dt._lcut_for_search(s)
 def _lcut_for_search_no_hmm(s):
    return dt._lcut_for_search_no_hmm(s)
 def _pcut(sentence, cut_all=False, HMM=True):
    parts = strdecode(sentence).splitlines(True)
    if cut_all:
        result = pool.map(_lcut_all, parts)
    elif HMM:
        result = pool.map(_lcut, parts)
    else:
        result = pool.map(_lcut_no_hmm, parts)
    for r in result:
        for w in r:
            yield w
 def _pcut_for_search(sentence, HMM=True):
    parts = strdecode(sentence).splitlines(True)
    if HMM:
        result = pool.map(_lcut_for_search, parts)
    else:
        result = pool.map(_lcut_for_search_no_hmm, parts)
    for r in result:
        for w in r:
            yield w
 def enable_parallel(processnum=None):
    """
    Change the module's `cut` and `cut_for_search` functions to the
    parallel version.
    Note that this only works using dt, custom Tokenizer
    instances are not supported.
    """
    global pool, dt, cut, cut_for_search
    from multiprocessing import cpu_count
    if os.name == 'nt':
        raise NotImplementedError(
            "jieba: parallel mode only supports posix system")
    else:
        from multiprocessing import Pool
    dt.check_initialized()
    if processnum is None:
        processnum = cpu_count()
    pool = Pool(processnum)
    cut = _pcut
    cut_for_search = _pcut_for_search
 def disable_parallel():
    global pool, dt, cut, cut_for_search
    if pool:
        pool.close()
        pool = None
    cut = dt.cut
    cut_for_search = dt.cut_for_search
--- a/jieba/main.py
+++ b/jieba/main.py
@ -8,12 +8,14 @@ parser = ArgumentParser(usage="%s -m jieba [options] filename" % sys.executable,
 parser.add_argument("-d", "--delimiter", metavar="DELIM", default=' / ',
                    nargs='?', const=' ',
                    help="use DELIM instead of ' / ' for word delimiter; or a space if it is used without DELIM")
 parser.add_argument("-p", "--pos", metavar="DELIM", nargs='?', const='_',
                    help="enable POS tagging; if DELIM is specified, use DELIM instead of '_' for POS delimiter")
 parser.add_argument("-D", "--dict", help="use DICT as dictionary")
 parser.add_argument("-u", "--user-dict",
                    help="use USER_DICT together with the default dictionary or DICT (if specified)")
 parser.add_argument("-a", "--cut-all",
                    action="store_true", dest="cutall", default=False,
-                    help="full pattern cutting")
+                    help="full pattern cutting (ignored with POS tagging)")
 parser.add_argument("-n", "--no-hmm", dest="hmm", action="store_false",
                    default=True, help="don't use the Hidden Markov Model")
 parser.add_argument("-q", "--quiet", action="store_true", default=False,
@ -26,6 +28,15 @@ args = parser.parse_args()
 if args.quiet:
    jieba.setLogLevel(60)
 if args.pos:
    import jieba.posseg
    posdelim = args.pos
    def cutfunc(sentence, _, HMM=True):
        for w, f in jieba.posseg.cut(sentence, HMM):
            yield w + posdelim + f
 else:
    cutfunc = jieba.cut
 delim = text_type(args.delimiter)
 cutall = args.cutall
 hmm = args.hmm
@ -41,7 +52,7 @@ if args.user_dict:
 ln = fp.readline()
 while ln:
    l = ln.rstrip('\r\n')
-    result = delim.join(jieba.cut(ln.rstrip('\r\n'), cutall, hmm))
+    result = delim.join(cutfunc(ln.rstrip('\r\n'), cutall, hmm))
    if PY2:
        result = result.encode(default_encoding)
    print(result)
--- a/jieba/_compat.py
+++ b/jieba/_compat.py
@ -1,6 +1,56 @@
 # -*- coding: utf-8 -*-
 import logging
 import os
 import sys
 log_console = logging.StreamHandler(sys.stderr)
 default_logger = logging.getLogger(__name__)
 default_logger.setLevel(logging.DEBUG)
 def setLogLevel(log_level):
    default_logger.setLevel(log_level)
 check_paddle_install = {'is_paddle_installed': False}
 try:
    import pkg_resources
    get_module_res = lambda *res: pkg_resources.resource_stream(__name__,
                                                                os.path.join(*res))
 except ImportError:
    get_module_res = lambda *res: open(os.path.normpath(os.path.join(
        os.getcwd(), os.path.dirname(__file__), *res)), 'rb')
 def enable_paddle():
    try:
        import paddle
    except ImportError:
        default_logger.debug("Installing paddle-tiny, please wait a minute......")
        os.system("pip install paddlepaddle-tiny")
        try:
            import paddle
        except ImportError:
            default_logger.debug(
                "Import paddle error, please use command to install: pip install paddlepaddle-tiny==1.6.1."
                "Now, back to jieba basic cut......")
    if paddle.__version__ < '1.6.1':
        default_logger.debug("Find your own paddle version doesn't satisfy the minimum requirement (1.6.1), "
                             "please install paddle tiny by 'pip install --upgrade paddlepaddle-tiny', "
                             "or upgrade paddle full version by "
                             "'pip install --upgrade paddlepaddle (-gpu for GPU version)' ")
    else:
        try:
            import jieba.lac_small.predict as predict
            default_logger.debug("Paddle enabled successfully......")
            check_paddle_install['is_paddle_installed'] = True
        except ImportError:
            default_logger.debug("Import error, cannot find paddle.fluid and jieba.lac_small.predict module. "
                                 "Now, back to jieba basic cut......")
 PY2 = sys.version_info[0] == 2
 default_encoding = sys.getfilesystemencoding()
@ -22,6 +72,7 @@ else:
    itervalues = lambda d: iter(d.values())
    iteritems = lambda d: iter(d.items())
 def strdecode(sentence):
    if not isinstance(sentence, text_type):
        try:
@ -29,3 +80,10 @@ def strdecode(sentence):
        except UnicodeDecodeError:
            sentence = sentence.decode('gbk', 'ignore')
    return sentence
 def resolve_filename(f):
    try:
        return f.name
    except AttributeError:
        return repr(f)
--- a/jieba/analyse/init.py
+++ b/jieba/analyse/init.py
@ -1,103 +1,18 @@
 #encoding=utf-8
 from __future__ import absolute_import
-import jieba
+from .tfidf import TFIDF
-import jieba.posseg
+from .textrank import TextRank
 import os
 from operator import itemgetter
 from .textrank import textrank
 try:
    from .analyzer import ChineseAnalyzer
 except ImportError:
    pass
-_curpath = os.path.normpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
+default_tfidf = TFIDF()
-abs_path = os.path.join(_curpath, "idf.txt")
+default_textrank = TextRank()
-STOP_WORDS = set((
+extract_tags = tfidf = default_tfidf.extract_tags
-    "the","of","is","and","to","in","that","we","for","an","are",
+set_idf_path = default_tfidf.set_idf_path
-    "by","be","as","on","with","can","if","from","which","you","it",
+textrank = default_textrank.extract_tags
    "this","then","at","have","all","not","one","has","or","that"
 ))
 class IDFLoader:
    def __init__(self):
        self.path = ""
        self.idf_freq = {}
        self.median_idf = 0.0
    def set_new_path(self, new_idf_path):
        if self.path != new_idf_path:
            content = open(new_idf_path, 'rb').read().decode('utf-8')
            idf_freq = {}
            lines = content.rstrip('\n').split('\n')
            for line in lines:
                word, freq = line.split(' ')
                idf_freq[word] = float(freq)
            median_idf = sorted(idf_freq.values())[len(idf_freq)//2]
            self.idf_freq = idf_freq
            self.median_idf = median_idf
            self.path = new_idf_path
    def get_idf(self):
        return self.idf_freq, self.median_idf
 idf_loader = IDFLoader()
 idf_loader.set_new_path(abs_path)
 def set_idf_path(idf_path):
    new_abs_path = os.path.normpath(os.path.join(os.getcwd(), idf_path))
    if not os.path.exists(new_abs_path):
        raise Exception("jieba: path does not exist: " + new_abs_path)
    idf_loader.set_new_path(new_abs_path)
 def set_stop_words(stop_words_path):
-    global STOP_WORDS
+    default_tfidf.set_stop_words(stop_words_path)
-    abs_path = os.path.normpath(os.path.join(os.getcwd(), stop_words_path))
+    default_textrank.set_stop_words(stop_words_path)
    if not os.path.exists(abs_path):
        raise Exception("jieba: path does not exist: " + abs_path)
    content = open(abs_path,'rb').read().decode('utf-8')
    lines = content.replace("\r", "").split('\n')
    for line in lines:
        STOP_WORDS.add(line)
 def extract_tags(sentence, topK=20, withWeight=False, allowPOS=[]):
    """
    Extract keywords from sentence using TF-IDF algorithm.
    Parameter:
        - topK: return how many top keywords. `None` for all possible words.
        - withWeight: if True, return a list of (word, weight);
                      if False, return a list of words.
        - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
                    if the POS of w is not in this list,it will be filtered.
    """
    global STOP_WORDS, idf_loader
    idf_freq, median_idf = idf_loader.get_idf()
    if allowPOS:
        allowPOS = frozenset(allowPOS)
        words = jieba.posseg.cut(sentence)
    else:
        words = jieba.cut(sentence)
    freq = {}
    for w in words:
        if allowPOS:
            if w.flag not in allowPOS:
                continue
            else:
                w = w.word
        if len(w.strip()) < 2 or w.lower() in STOP_WORDS:
            continue
        freq[w] = freq.get(w, 0.0) + 1.0
    total = sum(freq.values())
    for k in freq:
        freq[k] *= idf_freq.get(k, median_idf) / total
    if withWeight:
        tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
    else:
        tags = sorted(freq, key=freq.__getitem__, reverse=True)
    if topK:
        return tags[:topK]
    else:
        return tags
--- a/jieba/analyse/analyzer.py
+++ b/jieba/analyse/analyzer.py
@ -13,9 +13,11 @@ STOP_WORDS = frozenset(('a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'can',
                        'to', 'us', 'we', 'when', 'will', 'with', 'yet',
                        'you', 'your', '的', '了', '和'))
-accepted_chars = re.compile(r"[\u4E00-\u9FA5]+")
+accepted_chars = re.compile(r"[\u4E00-\u9FD5]+")
 class ChineseTokenizer(Tokenizer):
    def __call__(self, text, **kargs):
        words = jieba.tokenize(text, mode="search")
        token = Token()
@ -28,6 +30,7 @@ class ChineseTokenizer(Tokenizer):
            token.endchar = stop_pos
            yield token
 def ChineseAnalyzer(stoplist=STOP_WORDS, minsize=1, stemfn=stem, cachesize=50000):
    return (ChineseTokenizer() | LowercaseFilter() |
            StopFilter(stoplist=stoplist, minsize=minsize) |
--- a/jieba/analyse/textrank.py
+++ b/jieba/analyse/textrank.py
@ -3,9 +3,10 @@
 from __future__ import absolute_import, unicode_literals
 import sys
 import collections
 from operator import itemgetter
-import jieba.posseg as pseg
+from collections import defaultdict
 import jieba.posseg
 from .tfidf import KeywordExtractor
 from .._compat import *
@ -13,7 +14,7 @@ class UndirectWeightedGraph:
    d = 0.85
    def __init__(self):
-        self.graph = collections.defaultdict(list)
+        self.graph = defaultdict(list)
    def addEdge(self, start, end, weight):
        # use a tuple (start, end, weight) instead of a Edge object
@ -21,8 +22,8 @@ class UndirectWeightedGraph:
        self.graph[end].append((end, start, weight))
    def rank(self):
-        ws = collections.defaultdict(float)
+        ws = defaultdict(float)
-        outSum = collections.defaultdict(float)
+        outSum = defaultdict(float)
        wsdef = 1.0 / (len(self.graph) or 1.0)
        for n, out in self.graph.items():
@ -43,7 +44,7 @@ class UndirectWeightedGraph:
        for w in itervalues(ws):
            if w < min_rank:
                min_rank = w
-            elif w > max_rank:
+            if w > max_rank:
                max_rank = w
        for n, w in ws.items():
@ -53,7 +54,19 @@ class UndirectWeightedGraph:
        return ws
-def textrank(sentence, topK=10, withWeight=False, allowPOS=['ns', 'n', 'vn', 'v']):
+class TextRank(KeywordExtractor):
    def __init__(self):
        self.tokenizer = self.postokenizer = jieba.posseg.dt
        self.stop_words = self.STOP_WORDS.copy()
        self.pos_filt = frozenset(('ns', 'n', 'vn', 'v'))
        self.span = 5
    def pairfilter(self, wp):
        return (wp.flag in self.pos_filt and len(wp.word.strip()) >= 2
                and wp.word.lower() not in self.stop_words)
    def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False):
        """
        Extract keywords from sentence using TextRank algorithm.
        Parameter:
@ -62,20 +75,24 @@ def textrank(sentence, topK=10, withWeight=False, allowPOS=['ns', 'n', 'vn', 'v'
                          if False, return a list of words.
            - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
                        if the POS of w is not in this list, it will be filtered.
            - withFlag: if True, return a list of pair(word, weight) like posseg.cut
                        if False, return a list of words
        """
-    pos_filt = frozenset(allowPOS)
+        self.pos_filt = frozenset(allowPOS)
        g = UndirectWeightedGraph()
-    cm = collections.defaultdict(int)
+        cm = defaultdict(int)
-    span = 5
+        words = tuple(self.tokenizer.cut(sentence))
-    words = list(pseg.cut(sentence))
+        for i, wp in enumerate(words):
-    for i in xrange(len(words)):
+            if self.pairfilter(wp):
-        if words[i].flag in pos_filt:
+                for j in xrange(i + 1, i + self.span):
            for j in xrange(i + 1, i + span):
                    if j >= len(words):
                        break
-                if words[j].flag not in pos_filt:
+                    if not self.pairfilter(words[j]):
                        continue
-                cm[(words[i].word, words[j].word)] += 1
+                    if allowPOS and withFlag:
                        cm[(wp, words[j])] += 1
                    else:
                        cm[(wp.word, words[j].word)] += 1
        for terms, w in cm.items():
            g.addEdge(terms[0], terms[1], w)
@ -84,12 +101,10 @@ def textrank(sentence, topK=10, withWeight=False, allowPOS=['ns', 'n', 'vn', 'v'
            tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True)
        else:
            tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)
        if topK:
            return tags[:topK]
        else:
            return tags
-if __name__ == '__main__':
+    extract_tags = textrank
    s = "此外，公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元，增资后，吉林欧亚置业注册资本由7000万元增加到5亿元。吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城市商业综合体项目。2013年，实现营业收入0万元，实现净利润-139.13万元。"
    for x, w in textrank(s, withWeight=True):
        print('%s %s' % (x, w))
--- a/jieba/analyse/tfidf.py
+++ b/jieba/analyse/tfidf.py
@ -0,0 +1,116 @@
 # encoding=utf-8
 from __future__ import absolute_import
 import os
 import jieba
 import jieba.posseg
 from operator import itemgetter
 _get_module_path = lambda path: os.path.normpath(os.path.join(os.getcwd(),
                                                 os.path.dirname(__file__), path))
 _get_abs_path = jieba._get_abs_path
 DEFAULT_IDF = _get_module_path("idf.txt")
 class KeywordExtractor(object):
    STOP_WORDS = set((
        "the", "of", "is", "and", "to", "in", "that", "we", "for", "an", "are",
        "by", "be", "as", "on", "with", "can", "if", "from", "which", "you", "it",
        "this", "then", "at", "have", "all", "not", "one", "has", "or", "that"
    ))
    def set_stop_words(self, stop_words_path):
        abs_path = _get_abs_path(stop_words_path)
        if not os.path.isfile(abs_path):
            raise Exception("jieba: file does not exist: " + abs_path)
        content = open(abs_path, 'rb').read().decode('utf-8')
        for line in content.splitlines():
            self.stop_words.add(line)
    def extract_tags(self, *args, **kwargs):
        raise NotImplementedError
 class IDFLoader(object):
    def __init__(self, idf_path=None):
        self.path = ""
        self.idf_freq = {}
        self.median_idf = 0.0
        if idf_path:
            self.set_new_path(idf_path)
    def set_new_path(self, new_idf_path):
        if self.path != new_idf_path:
            self.path = new_idf_path
            content = open(new_idf_path, 'rb').read().decode('utf-8')
            self.idf_freq = {}
            for line in content.splitlines():
                word, freq = line.strip().split(' ')
                self.idf_freq[word] = float(freq)
            self.median_idf = sorted(
                self.idf_freq.values())[len(self.idf_freq) // 2]
    def get_idf(self):
        return self.idf_freq, self.median_idf
 class TFIDF(KeywordExtractor):
    def __init__(self, idf_path=None):
        self.tokenizer = jieba.dt
        self.postokenizer = jieba.posseg.dt
        self.stop_words = self.STOP_WORDS.copy()
        self.idf_loader = IDFLoader(idf_path or DEFAULT_IDF)
        self.idf_freq, self.median_idf = self.idf_loader.get_idf()
    def set_idf_path(self, idf_path):
        new_abs_path = _get_abs_path(idf_path)
        if not os.path.isfile(new_abs_path):
            raise Exception("jieba: file does not exist: " + new_abs_path)
        self.idf_loader.set_new_path(new_abs_path)
        self.idf_freq, self.median_idf = self.idf_loader.get_idf()
    def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=(), withFlag=False):
        """
        Extract keywords from sentence using TF-IDF algorithm.
        Parameter:
            - topK: return how many top keywords. `None` for all possible words.
            - withWeight: if True, return a list of (word, weight);
                          if False, return a list of words.
            - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
                        if the POS of w is not in this list,it will be filtered.
            - withFlag: only work with allowPOS is not empty.
                        if True, return a list of pair(word, weight) like posseg.cut
                        if False, return a list of words
        """
        if allowPOS:
            allowPOS = frozenset(allowPOS)
            words = self.postokenizer.cut(sentence)
        else:
            words = self.tokenizer.cut(sentence)
        freq = {}
        for w in words:
            if allowPOS:
                if w.flag not in allowPOS:
                    continue
                elif not withFlag:
                    w = w.word
            wc = w.word if allowPOS and withFlag else w
            if len(wc.strip()) < 2 or wc.lower() in self.stop_words:
                continue
            freq[w] = freq.get(w, 0.0) + 1.0
        total = sum(freq.values())
        for k in freq:
            kw = k.word if allowPOS and withFlag else k
            freq[k] *= self.idf_freq.get(kw, self.median_idf) / total
        if withWeight:
            tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
        else:
            tags = sorted(freq, key=freq.__getitem__, reverse=True)
        if topK:
            return tags[:topK]
        else:
            return tags
--- a/jieba/finalseg/init.py
+++ b/jieba/finalseg/init.py
@ -1,8 +1,8 @@
 from __future__ import absolute_import, unicode_literals
 import re
 import os
 import marshal
 import sys
 import pickle
 from .._compat import *
 MIN_FLOAT = -3.14e100
@ -19,26 +19,11 @@ PrevStatus = {
    'E': 'BM'
 }
-
+Force_Split_Words = set([])
 def load_model():
-    _curpath = os.path.normpath(
+    start_p = pickle.load(get_module_res("finalseg", PROB_START_P))
-        os.path.join(os.getcwd(), os.path.dirname(__file__)))
+    trans_p = pickle.load(get_module_res("finalseg", PROB_TRANS_P))
-
+    emit_p = pickle.load(get_module_res("finalseg", PROB_EMIT_P))
    start_p = {}
    abs_path = os.path.join(_curpath, PROB_START_P)
    with open(abs_path, 'rb') as f:
        start_p = marshal.load(f)
    trans_p = {}
    abs_path = os.path.join(_curpath, PROB_TRANS_P)
    with open(abs_path, 'rb') as f:
        trans_p = marshal.load(f)
    emit_p = {}
    abs_path = os.path.join(_curpath, PROB_EMIT_P)
    with open(abs_path, 'rb') as f:
        emit_p = marshal.load(f)
    return start_p, trans_p, emit_p
 if sys.platform.startswith("java"):
@ -89,17 +74,25 @@ def __cut(sentence):
    if nexti < len(sentence):
        yield sentence[nexti:]
-re_han = re.compile("([\u4E00-\u9FA5]+)")
+re_han = re.compile("([\u4E00-\u9FD5]+)")
-re_skip = re.compile("(\d+\.\d+|[a-zA-Z0-9]+)")
+re_skip = re.compile("([a-zA-Z0-9]+(?:\.\d+)?%?)")
 def add_force_split(word):
    global Force_Split_Words
    Force_Split_Words.add(word)
 def cut(sentence):
    sentence = strdecode(sentence)
    blocks = re_han.split(sentence)
    for blk in blocks:
        if re_han.match(blk):
            for word in __cut(blk):
                if word not in Force_Split_Words:
                    yield word
                else:
                    for c in word:
                        yield c
        else:
            tmp = re_skip.split(blk)
            for x in tmp:
--- a/jieba/finalseg/prob_emit.p
+++ b/jieba/finalseg/prob_emit.p
--- a/jieba/finalseg/prob_start.p
+++ b/jieba/finalseg/prob_start.p
--- a/jieba/finalseg/prob_trans.p
+++ b/jieba/finalseg/prob_trans.p
--- a/jieba/lac_small/init.py
+++ b/jieba/lac_small/init.py
--- a/jieba/lac_small/creator.py
+++ b/jieba/lac_small/creator.py
@ -0,0 +1,46 @@
 # -*- coding: UTF-8 -*-
 #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Define the function to create lexical analysis model and model's data reader
 """
 import sys
 import os
 import math
 import paddle
 import paddle.fluid as fluid
 from paddle.fluid.initializer import NormalInitializer
 import jieba.lac_small.nets as nets
 def create_model(vocab_size, num_labels, mode='train'):
    """create lac model"""
    # model's input data
    words = fluid.data(name='words', shape=[-1, 1], dtype='int64', lod_level=1)
    targets = fluid.data(
        name='targets', shape=[-1, 1], dtype='int64', lod_level=1)
    # for inference process
    if mode == 'infer':
        crf_decode = nets.lex_net(
            words, vocab_size, num_labels, for_infer=True, target=None)
        return {
            "feed_list": [words],
            "words": words,
            "crf_decode": crf_decode,
        }
    return ret
--- a/jieba/lac_small/model_baseline/crfw
+++ b/jieba/lac_small/model_baseline/crfw
--- a/jieba/lac_small/model_baseline/fc_0.b_0
+++ b/jieba/lac_small/model_baseline/fc_0.b_0
--- a/jieba/lac_small/model_baseline/fc_0.w_0
+++ b/jieba/lac_small/model_baseline/fc_0.w_0
--- a/jieba/lac_small/model_baseline/fc_1.b_0
+++ b/jieba/lac_small/model_baseline/fc_1.b_0
--- a/jieba/lac_small/model_baseline/fc_1.w_0
+++ b/jieba/lac_small/model_baseline/fc_1.w_0
--- a/jieba/lac_small/model_baseline/fc_2.b_0
+++ b/jieba/lac_small/model_baseline/fc_2.b_0
--- a/jieba/lac_small/model_baseline/fc_2.w_0
+++ b/jieba/lac_small/model_baseline/fc_2.w_0
--- a/jieba/lac_small/model_baseline/fc_3.b_0
+++ b/jieba/lac_small/model_baseline/fc_3.b_0
--- a/jieba/lac_small/model_baseline/fc_3.w_0
+++ b/jieba/lac_small/model_baseline/fc_3.w_0
--- a/jieba/lac_small/model_baseline/fc_4.b_0
+++ b/jieba/lac_small/model_baseline/fc_4.b_0
--- a/jieba/lac_small/model_baseline/fc_4.w_0
+++ b/jieba/lac_small/model_baseline/fc_4.w_0
--- a/jieba/lac_small/model_baseline/gru_0.b_0
+++ b/jieba/lac_small/model_baseline/gru_0.b_0
--- a/jieba/lac_small/model_baseline/gru_0.w_0
+++ b/jieba/lac_small/model_baseline/gru_0.w_0
--- a/jieba/lac_small/model_baseline/gru_1.b_0
+++ b/jieba/lac_small/model_baseline/gru_1.b_0
--- a/jieba/lac_small/model_baseline/gru_1.w_0
+++ b/jieba/lac_small/model_baseline/gru_1.w_0
--- a/jieba/lac_small/model_baseline/gru_2.b_0
+++ b/jieba/lac_small/model_baseline/gru_2.b_0
--- a/jieba/lac_small/model_baseline/gru_2.w_0
+++ b/jieba/lac_small/model_baseline/gru_2.w_0
--- a/jieba/lac_small/model_baseline/gru_3.b_0
+++ b/jieba/lac_small/model_baseline/gru_3.b_0
--- a/jieba/lac_small/model_baseline/gru_3.w_0
+++ b/jieba/lac_small/model_baseline/gru_3.w_0
--- a/jieba/lac_small/model_baseline/word_emb
+++ b/jieba/lac_small/model_baseline/word_emb
--- a/jieba/lac_small/nets.py
+++ b/jieba/lac_small/nets.py
@ -0,0 +1,122 @@
 #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 The function lex_net(args) define the lexical analysis network structure
 """
 import sys
 import os
 import math
 import paddle.fluid as fluid
 from paddle.fluid.initializer import NormalInitializer
 def lex_net(word, vocab_size, num_labels, for_infer=True, target=None):
    """
    define the lexical analysis network structure
    word: stores the input of the model
    for_infer: a boolean value, indicating if the model to be created is for training or predicting.
    return:
        for infer: return the prediction
        otherwise: return the prediction
    """
    word_emb_dim=128
    grnn_hidden_dim=128
    bigru_num=2
    emb_lr = 1.0
    crf_lr = 1.0
    init_bound = 0.1
    IS_SPARSE = True
    def _bigru_layer(input_feature):
        """
        define the bidirectional gru layer
        """
        pre_gru = fluid.layers.fc(
            input=input_feature,
            size=grnn_hidden_dim * 3,
            param_attr=fluid.ParamAttr(
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound),
                regularizer=fluid.regularizer.L2DecayRegularizer(
                    regularization_coeff=1e-4)))
        gru = fluid.layers.dynamic_gru(
            input=pre_gru,
            size=grnn_hidden_dim,
            param_attr=fluid.ParamAttr(
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound),
                regularizer=fluid.regularizer.L2DecayRegularizer(
                    regularization_coeff=1e-4)))
        pre_gru_r = fluid.layers.fc(
            input=input_feature,
            size=grnn_hidden_dim * 3,
            param_attr=fluid.ParamAttr(
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound),
                regularizer=fluid.regularizer.L2DecayRegularizer(
                    regularization_coeff=1e-4)))
        gru_r = fluid.layers.dynamic_gru(
            input=pre_gru_r,
            size=grnn_hidden_dim,
            is_reverse=True,
            param_attr=fluid.ParamAttr(
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound),
                regularizer=fluid.regularizer.L2DecayRegularizer(
                    regularization_coeff=1e-4)))
        bi_merge = fluid.layers.concat(input=[gru, gru_r], axis=1)
        return bi_merge
    def _net_conf(word, target=None):
        """
        Configure the network
        """
        word_embedding = fluid.embedding(
            input=word,
            size=[vocab_size, word_emb_dim],
            dtype='float32',
            is_sparse=IS_SPARSE,
            param_attr=fluid.ParamAttr(
                learning_rate=emb_lr,
                name="word_emb",
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound)))
        input_feature = word_embedding
        for i in range(bigru_num):
            bigru_output = _bigru_layer(input_feature)
            input_feature = bigru_output
        emission = fluid.layers.fc(
            size=num_labels,
            input=bigru_output,
            param_attr=fluid.ParamAttr(
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound),
                regularizer=fluid.regularizer.L2DecayRegularizer(
                    regularization_coeff=1e-4)))
        size = emission.shape[1]
        fluid.layers.create_parameter(
            shape=[size + 2, size], dtype=emission.dtype, name='crfw')
        crf_decode = fluid.layers.crf_decoding(
            input=emission, param_attr=fluid.ParamAttr(name='crfw'))
        return crf_decode
    return _net_conf(word)
--- a/jieba/lac_small/predict.py
+++ b/jieba/lac_small/predict.py
@ -0,0 +1,82 @@
 # -*- coding: UTF-8 -*-
 #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import os
 import time
 import sys
 import paddle.fluid as fluid
 import paddle
 import jieba.lac_small.utils as utils
 import jieba.lac_small.creator as creator
 import jieba.lac_small.reader_small as reader_small
 import numpy
 word_emb_dim=128
 grnn_hidden_dim=128
 bigru_num=2
 use_cuda=False
 basepath = os.path.abspath(__file__)
 folder = os.path.dirname(basepath)
 init_checkpoint = os.path.join(folder, "model_baseline")
 batch_size=1
 dataset = reader_small.Dataset()
 infer_program = fluid.Program()
 with fluid.program_guard(infer_program, fluid.default_startup_program()):
    with fluid.unique_name.guard():
        infer_ret = creator.create_model(dataset.vocab_size, dataset.num_labels, mode='infer')
 infer_program = infer_program.clone(for_test=True)
 place = fluid.CPUPlace()
 exe = fluid.Executor(place)
 exe.run(fluid.default_startup_program())
 utils.init_checkpoint(exe, init_checkpoint, infer_program)
 results = []
 def get_sent(str1):
    feed_data=dataset.get_vars(str1)
    a = numpy.array(feed_data).astype(numpy.int64)
    a=a.reshape(-1,1)
    c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
    words, crf_decode = exe.run(
            infer_program,
            fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
            feed={"words":c, },
            return_numpy=False,
            use_program_cache=True)
    sents=[]
    sent,tag = utils.parse_result(words, crf_decode, dataset)
    sents = sents + sent
    return sents
 def get_result(str1):
    feed_data=dataset.get_vars(str1)
    a = numpy.array(feed_data).astype(numpy.int64)
    a=a.reshape(-1,1)
    c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
    words, crf_decode = exe.run(
            infer_program,
            fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
            feed={"words":c, },
            return_numpy=False,
            use_program_cache=True)
    results=[]
    results += utils.parse_result(words, crf_decode, dataset)
    return results
--- a/jieba/lac_small/reader_small.py
+++ b/jieba/lac_small/reader_small.py
@ -0,0 +1,100 @@
 #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 The file_reader converts raw corpus to input.
 """
 import os
 import __future__
 import io
 import paddle
 import paddle.fluid as fluid
 def load_kv_dict(dict_path,
                 reverse=False,
                 delimiter="\t",
                 key_func=None,
                 value_func=None):
    """
    Load key-value dict from file
    """
    result_dict = {}
    for line in io.open(dict_path, "r", encoding='utf8'):
        terms = line.strip("\n").split(delimiter)
        if len(terms) != 2:
            continue
        if reverse:
            value, key = terms
        else:
            key, value = terms
        if key in result_dict:
            raise KeyError("key duplicated with [%s]" % (key))
        if key_func:
            key = key_func(key)
        if value_func:
            value = value_func(value)
        result_dict[key] = value
    return result_dict
 class Dataset(object):
    """data reader"""
    def __init__(self):
        # read dict
        basepath = os.path.abspath(__file__)
        folder = os.path.dirname(basepath)
        word_dict_path = os.path.join(folder, "word.dic")
        label_dict_path = os.path.join(folder, "tag.dic")
        self.word2id_dict = load_kv_dict(
            word_dict_path, reverse=True, value_func=int)
        self.id2word_dict = load_kv_dict(word_dict_path)
        self.label2id_dict = load_kv_dict(
            label_dict_path, reverse=True, value_func=int)
        self.id2label_dict = load_kv_dict(label_dict_path)
    @property
    def vocab_size(self):
        """vocabulary size"""
        return max(self.word2id_dict.values()) + 1
    @property
    def num_labels(self):
        """num_labels"""
        return max(self.label2id_dict.values()) + 1
    def word_to_ids(self, words):
        """convert word to word index"""
        word_ids = []
        for word in words:
            if word not in self.word2id_dict:
                word = "OOV"
            word_id = self.word2id_dict[word]
            word_ids.append(word_id)
        return word_ids
    def label_to_ids(self, labels):
        """convert label to label index"""
        label_ids = []
        for label in labels:
            if label not in self.label2id_dict:
                label = "O"
            label_id = self.label2id_dict[label]
            label_ids.append(label_id)
        return label_ids
    def get_vars(self,str1):
        words = str1.strip()
        word_ids = self.word_to_ids(words)
        return word_ids
--- a/jieba/lac_small/tag.dic
+++ b/jieba/lac_small/tag.dic
@ -0,0 +1,57 @@
 0	a-B
 1	a-I
 2	ad-B
 3	ad-I
 4	an-B
 5	an-I
 6	c-B
 7	c-I
 8	d-B
 9	d-I
 10	f-B
 11	f-I
 12	m-B
 13	m-I
 14	n-B
 15	n-I
 16	nr-B
 17	nr-I
 18	ns-B
 19	ns-I
 20	nt-B
 21	nt-I
 22	nw-B
 23	nw-I
 24	nz-B
 25	nz-I
 26	p-B
 27	p-I
 28	q-B
 29	q-I
 30	r-B
 31	r-I
 32	s-B
 33	s-I
 34	t-B
 35	t-I
 36	u-B
 37	u-I
 38	v-B
 39	v-I
 40	vd-B
 41	vd-I
 42	vn-B
 43	vn-I
 44	w-B
 45	w-I
 46	xc-B
 47	xc-I
 48	PER-B
 49	PER-I
 50	LOC-B
 51	LOC-I
 52	ORG-B
 53	ORG-I
 54	TIME-B
 55	TIME-I
 56	O
--- a/jieba/lac_small/utils.py
+++ b/jieba/lac_small/utils.py
@ -0,0 +1,142 @@
 #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 util tools
 """
 from __future__ import print_function
 import os
 import sys
 import numpy as np
 import paddle.fluid as fluid
 import io
 def str2bool(v):
    """
    argparse does not support True or False in python
    """
    return v.lower() in ("true", "t", "1")
 def parse_result(words, crf_decode, dataset):
    """ parse result """
    offset_list = (crf_decode.lod())[0]
    words = np.array(words)
    crf_decode = np.array(crf_decode)
    batch_size = len(offset_list) - 1
    for sent_index in range(batch_size):
        begin, end = offset_list[sent_index], offset_list[sent_index + 1]
        sent=[]
        for id in words[begin:end]:
            if dataset.id2word_dict[str(id[0])]=='OOV':
                sent.append(' ')
            else:
                sent.append(dataset.id2word_dict[str(id[0])])
        tags = [
            dataset.id2label_dict[str(id[0])] for id in crf_decode[begin:end]
        ]
        sent_out = []
        tags_out = []
        parital_word = ""
        for ind, tag in enumerate(tags):
            # for the first word
            if parital_word == "":
                parital_word = sent[ind]
                tags_out.append(tag.split('-')[0])
                continue
            # for the beginning of word
            if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
                sent_out.append(parital_word)
                tags_out.append(tag.split('-')[0])
                parital_word = sent[ind]
                continue
            parital_word += sent[ind]
        # append the last word, except for len(tags)=0
        if len(sent_out) < len(tags_out):
            sent_out.append(parital_word)
    return sent_out,tags_out
 def parse_padding_result(words, crf_decode, seq_lens, dataset):
    """ parse padding result """
    words = np.squeeze(words)
    batch_size = len(seq_lens)
    batch_out = []
    for sent_index in range(batch_size):
        sent=[]
        for id in words[begin:end]:
            if dataset.id2word_dict[str(id[0])]=='OOV':
                sent.append(' ')
            else:
                sent.append(dataset.id2word_dict[str(id[0])])
        tags = [
            dataset.id2label_dict[str(id)]
            for id in crf_decode[sent_index][1:seq_lens[sent_index] - 1]
        ]
        sent_out = []
        tags_out = []
        parital_word = ""
        for ind, tag in enumerate(tags):
            # for the first word
            if parital_word == "":
                parital_word = sent[ind]
                tags_out.append(tag.split('-')[0])
                continue
            # for the beginning of word
            if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
                sent_out.append(parital_word)
                tags_out.append(tag.split('-')[0])
                parital_word = sent[ind]
                continue
            parital_word += sent[ind]
        # append the last word, except for len(tags)=0
        if len(sent_out) < len(tags_out):
            sent_out.append(parital_word)
        batch_out.append([sent_out, tags_out])
    return batch_out
 def init_checkpoint(exe, init_checkpoint_path, main_program):
    """
    Init CheckPoint
    """
    assert os.path.exists(
        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
    def existed_persitables(var):
        """
        If existed presitabels
        """
        if not fluid.io.is_persistable(var):
            return False
        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
    fluid.io.load_vars(
        exe,
        init_checkpoint_path,
        main_program=main_program,
        predicate=existed_persitables)
--- a/jieba/lac_small/word.dic
+++ b/jieba/lac_small/word.dic
--- a/jieba/posseg/init.py
+++ b/jieba/posseg/init.py
@ -1,21 +1,20 @@
 from __future__ import absolute_import, unicode_literals
 import pickle
 import re
-import os
+
 import jieba
 import sys
 import marshal
 from functools import wraps
 from .._compat import *
 from .viterbi import viterbi
 from .._compat import *
 PROB_START_P = "prob_start.p"
 PROB_TRANS_P = "prob_trans.p"
 PROB_EMIT_P = "prob_emit.p"
 CHAR_STATE_TAB_P = "char_state_tab.p"
-re_han_detail = re.compile("([\u4E00-\u9FA5]+)")
+re_han_detail = re.compile("([\u4E00-\u9FD5]+)")
 re_skip_detail = re.compile("([\.0-9]+|[a-zA-Z0-9]+)")
-re_han_internal = re.compile("([\u4E00-\u9FA5a-zA-Z0-9+#&\._]+)")
+re_han_internal = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._]+)")
 re_skip_internal = re.compile("(\r\n|\s)")
 re_eng = re.compile("[a-zA-Z0-9]+")
@ -24,69 +23,23 @@ re_num = re.compile("[\.0-9]+")
 re_eng1 = re.compile('^[a-zA-Z0-9]$', re.U)
-def load_model(f_name, isJython=True):
+def load_model():
-    _curpath = os.path.normpath(
+    # For Jython
-        os.path.join(os.getcwd(), os.path.dirname(__file__)))
+    start_p = pickle.load(get_module_res("posseg", PROB_START_P))
    trans_p = pickle.load(get_module_res("posseg", PROB_TRANS_P))
    emit_p = pickle.load(get_module_res("posseg", PROB_EMIT_P))
    state = pickle.load(get_module_res("posseg", CHAR_STATE_TAB_P))
    return state, start_p, trans_p, emit_p
    result = {}
    with open(f_name, "rb") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            line = line.decode("utf-8")
            word, _, tag = line.split(" ")
            result[word] = tag
    if not isJython:
        return result
    start_p = {}
    abs_path = os.path.join(_curpath, PROB_START_P)
    with open(abs_path, 'rb') as f:
        start_p = marshal.load(f)
    trans_p = {}
    abs_path = os.path.join(_curpath, PROB_TRANS_P)
    with open(abs_path, 'rb') as f:
        trans_p = marshal.load(f)
    emit_p = {}
    abs_path = os.path.join(_curpath, PROB_EMIT_P)
    with open(abs_path, 'rb') as f:
        emit_p = marshal.load(f)
    state = {}
    abs_path = os.path.join(_curpath, CHAR_STATE_TAB_P)
    with open(abs_path, 'rb') as f:
        state = marshal.load(f)
    f.closed
    return state, start_p, trans_p, emit_p, result
 if sys.platform.startswith("java"):
-    char_state_tab_P, start_P, trans_P, emit_P, word_tag_tab = load_model(
+    char_state_tab_P, start_P, trans_P, emit_P = load_model()
        jieba.get_abs_path_dict())
 else:
    from .char_state_tab import P as char_state_tab_P
    from .prob_start import P as start_P
    from .prob_trans import P as trans_P
    from .prob_emit import P as emit_P
    word_tag_tab = load_model(jieba.get_abs_path_dict(), isJython=False)
 def makesure_userdict_loaded(fn):
    @wraps(fn)
    def wrapped(*args, **kwargs):
        if jieba.user_word_tag_tab:
            word_tag_tab.update(jieba.user_word_tag_tab)
            jieba.user_word_tag_tab = {}
        return fn(*args, **kwargs)
    return wrapped
 class pair(object):
@ -98,7 +51,7 @@ class pair(object):
        return '%s/%s' % (self.word, self.flag)
    def __repr__(self):
-        return self.__str__()
+        return 'pair(%r, %r)' % (self.word, self.flag)
    def __str__(self):
        if PY2:
@ -106,11 +59,62 @@ class pair(object):
        else:
            return self.__unicode__()
    def __iter__(self):
        return iter((self.word, self.flag))
    def __lt__(self, other):
        return self.word < other.word
    def __eq__(self, other):
        return isinstance(other, pair) and self.word == other.word and self.flag == other.flag
    def __hash__(self):
        return hash(self.word)
    def encode(self, arg):
        return self.__unicode__().encode(arg)
-def __cut(sentence):
+class POSTokenizer(object):
    def __init__(self, tokenizer=None):
        self.tokenizer = tokenizer or jieba.Tokenizer()
        self.load_word_tag(self.tokenizer.get_dict_file())
    def __repr__(self):
        return '<POSTokenizer tokenizer=%r>' % self.tokenizer
    def __getattr__(self, name):
        if name in ('cut_for_search', 'lcut_for_search', 'tokenize'):
            # may be possible?
            raise NotImplementedError
        return getattr(self.tokenizer, name)
    def initialize(self, dictionary=None):
        self.tokenizer.initialize(dictionary)
        self.load_word_tag(self.tokenizer.get_dict_file())
    def load_word_tag(self, f):
        self.word_tag_tab = {}
        f_name = resolve_filename(f)
        for lineno, line in enumerate(f, 1):
            try:
                line = line.strip().decode("utf-8")
                if not line:
                    continue
                word, _, tag = line.split(" ")
                self.word_tag_tab[word] = tag
            except Exception:
                raise ValueError(
                    'invalid POS dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
        f.close()
    def makesure_userdict_loaded(self):
        if self.tokenizer.user_word_tag_tab:
            self.word_tag_tab.update(self.tokenizer.user_word_tag_tab)
            self.tokenizer.user_word_tag_tab = {}
    def __cut(self, sentence):
        prob, pos_list = viterbi(
            sentence, char_state_tab_P, start_P, trans_P, emit_P)
        begin, nexti = 0, 0
@ -128,12 +132,11 @@ def __cut(sentence):
        if nexti < len(sentence):
            yield pair(sentence[nexti:], pos_list[nexti][1])
-
+    def __cut_detail(self, sentence):
 def __cut_detail(sentence):
        blocks = re_han_detail.split(sentence)
        for blk in blocks:
            if re_han_detail.match(blk):
-            for word in __cut(blk):
+                for word in self.__cut(blk):
                    yield word
            else:
                tmp = re_skip_detail.split(blk)
@ -146,11 +149,10 @@ def __cut_detail(sentence):
                        else:
                            yield pair(x, 'x')
-
+    def __cut_DAG_NO_HMM(self, sentence):
-def __cut_DAG_NO_HMM(sentence):
+        DAG = self.tokenizer.get_DAG(sentence)
    DAG = jieba.get_DAG(sentence)
        route = {}
-    jieba.calc(sentence, DAG, route)
+        self.tokenizer.calc(sentence, DAG, route)
        x = 0
        N = len(sentence)
        buf = ''
@ -164,18 +166,17 @@ def __cut_DAG_NO_HMM(sentence):
                if buf:
                    yield pair(buf, 'eng')
                    buf = ''
-            yield pair(l_word, word_tag_tab.get(l_word, 'x'))
+                yield pair(l_word, self.word_tag_tab.get(l_word, 'x'))
                x = y
        if buf:
            yield pair(buf, 'eng')
            buf = ''
-
+    def __cut_DAG(self, sentence):
-def __cut_DAG(sentence):
+        DAG = self.tokenizer.get_DAG(sentence)
    DAG = jieba.get_DAG(sentence)
        route = {}
-    jieba.calc(sentence, DAG, route)
+        self.tokenizer.calc(sentence, DAG, route)
        x = 0
        buf = ''
@ -188,41 +189,41 @@ def __cut_DAG(sentence):
            else:
                if buf:
                    if len(buf) == 1:
-                    yield pair(buf, word_tag_tab.get(buf, 'x'))
+                        yield pair(buf, self.word_tag_tab.get(buf, 'x'))
-                elif buf not in jieba.FREQ:
+                    elif not self.tokenizer.FREQ.get(buf):
-                    recognized = __cut_detail(buf)
+                        recognized = self.__cut_detail(buf)
                        for t in recognized:
                            yield t
                    else:
                        for elem in buf:
-                        yield pair(elem, word_tag_tab.get(elem, 'x'))
+                            yield pair(elem, self.word_tag_tab.get(elem, 'x'))
                    buf = ''
-            yield pair(l_word, word_tag_tab.get(l_word, 'x'))
+                yield pair(l_word, self.word_tag_tab.get(l_word, 'x'))
            x = y
        if buf:
            if len(buf) == 1:
-            yield pair(buf, word_tag_tab.get(buf, 'x'))
+                yield pair(buf, self.word_tag_tab.get(buf, 'x'))
-        elif (buf not in jieba.FREQ):
+            elif not self.tokenizer.FREQ.get(buf):
-            recognized = __cut_detail(buf)
+                recognized = self.__cut_detail(buf)
                for t in recognized:
                    yield t
            else:
                for elem in buf:
-                yield pair(elem, word_tag_tab.get(elem, 'x'))
+                    yield pair(elem, self.word_tag_tab.get(elem, 'x'))
-
+    def __cut_internal(self, sentence, HMM=True):
-def __cut_internal(sentence, HMM=True):
+        self.makesure_userdict_loaded()
        sentence = strdecode(sentence)
        blocks = re_han_internal.split(sentence)
        if HMM:
-        __cut_blk = __cut_DAG
+            cut_blk = self.__cut_DAG
        else:
-        __cut_blk = __cut_DAG_NO_HMM
+            cut_blk = self.__cut_DAG_NO_HMM
        for blk in blocks:
            if re_han_internal.match(blk):
-            for word in __cut_blk(blk):
+                for word in cut_blk(blk):
                    yield word
            else:
                tmp = re_skip_internal.split(blk)
@ -238,26 +239,72 @@ def __cut_internal(sentence, HMM=True):
                            else:
                                yield pair(xx, 'x')
    def _lcut_internal(self, sentence):
        return list(self.__cut_internal(sentence))
-def __lcut_internal(sentence):
+    def _lcut_internal_no_hmm(self, sentence):
-    return list(__cut_internal(sentence))
+        return list(self.__cut_internal(sentence, False))
    def cut(self, sentence, HMM=True):
        for w in self.__cut_internal(sentence, HMM=HMM):
            yield w
    def lcut(self, *args, **kwargs):
        return list(self.cut(*args, **kwargs))
-def __lcut_internal_no_hmm(sentence):
+# default Tokenizer instance
-    return list(__cut_internal(sentence, False))
+
 dt = POSTokenizer(jieba.dt)
 # global functions
 initialize = dt.initialize
-@makesure_userdict_loaded
+def _lcut_internal(s):
-def cut(sentence, HMM=True):
+    return dt._lcut_internal(s)
 def _lcut_internal_no_hmm(s):
    return dt._lcut_internal_no_hmm(s)
 def cut(sentence, HMM=True, use_paddle=False):
    """
    Global `cut` function that supports parallel processing.
    Note that this only works using dt, custom POSTokenizer
    instances are not supported.
    """
    is_paddle_installed = check_paddle_install['is_paddle_installed']
    if use_paddle and is_paddle_installed:
        # if sentence is null, it will raise core exception in paddle.
        if sentence is None or sentence == "" or sentence == u"":
            return
        import jieba.lac_small.predict as predict
        sents, tags = predict.get_result(strdecode(sentence))
        for i, sent in enumerate(sents):
            if sent is None or tags[i] is None:
                continue
            yield pair(sent, tags[i])
        return
    global dt
    if jieba.pool is None:
-        for w in __cut_internal(sentence, HMM=HMM):
+        for w in dt.cut(sentence, HMM=HMM):
            yield w
    else:
        parts = strdecode(sentence).splitlines(True)
        if HMM:
-            result = jieba.pool.map(__lcut_internal, parts)
+            result = jieba.pool.map(_lcut_internal, parts)
        else:
-            result = jieba.pool.map(__lcut_internal_no_hmm, parts)
+            result = jieba.pool.map(_lcut_internal_no_hmm, parts)
        for r in result:
            for w in r:
                yield w
 def lcut(sentence, HMM=True, use_paddle=False):
    if use_paddle:
        return list(cut(sentence, use_paddle=True))
    return list(cut(sentence, HMM))
--- a/jieba/posseg/char_state_tab.p
+++ b/jieba/posseg/char_state_tab.p
--- a/jieba/posseg/prob_emit.p
+++ b/jieba/posseg/prob_emit.p
--- a/jieba/posseg/prob_start.p
+++ b/jieba/posseg/prob_start.p
--- a/jieba/posseg/prob_trans.p
+++ b/jieba/posseg/prob_trans.p
--- a/setup.py
+++ b/setup.py
@ -43,8 +43,8 @@ GitHub: https://github.com/fxsjy/jieba
 """
 setup(name='jieba',
-      version='0.36',
+      version='0.42.1',
-      description='Chinese Words Segementation Utilities',
+      description='Chinese Words Segmentation Utilities',
      long_description=LONGDOC,
      author='Sun, Junyi',
      author_email='ccnusjy@gmail.com',
@ -71,5 +71,5 @@ setup(name='jieba',
      keywords='NLP,tokenizing,Chinese word segementation',
      packages=['jieba'],
      package_dir={'jieba':'jieba'},
-      package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*']}
+      package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*', 'lac_small/*.py','lac_small/*.dic', 'lac_small/model_baseline/*']}
 )
--- a/test/demo.py
+++ b/test/demo.py
@ -4,6 +4,12 @@ import sys
 sys.path.append("../")
 import jieba
 import jieba.posseg
 import jieba.analyse
 print('='*40)
 print('1. 分词')
 print('-'*40)
 seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
 print("Full Mode: " + "/ ".join(seg_list))  # 全模式
@ -16,3 +22,63 @@ print(", ".join(seg_list))
 seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  # 搜索引擎模式
 print(", ".join(seg_list))
 print('='*40)
 print('2. 添加自定义词典/调整词典')
 print('-'*40)
 print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
 #如果/放到/post/中将/出错/。
 print(jieba.suggest_freq(('中', '将'), True))
 #494
 print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
 #如果/放到/post/中/将/出错/。
 print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
 #「/台/中/」/正确/应该/不会/被/切开
 print(jieba.suggest_freq('台中', True))
 #69
 print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
 #「/台中/」/正确/应该/不会/被/切开
 print('='*40)
 print('3. 关键词提取')
 print('-'*40)
 print(' TF-IDF')
 print('-'*40)
 s = "此外，公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元，增资后，吉林欧亚置业注册资本由7000万元增加到5亿元。吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城市商业综合体项目。2013年，实现营业收入0万元，实现净利润-139.13万元。"
 for x, w in jieba.analyse.extract_tags(s, withWeight=True):
    print('%s %s' % (x, w))
 print('-'*40)
 print(' TextRank')
 print('-'*40)
 for x, w in jieba.analyse.textrank(s, withWeight=True):
    print('%s %s' % (x, w))
 print('='*40)
 print('4. 词性标注')
 print('-'*40)
 words = jieba.posseg.cut("我爱北京天安门")
 for word, flag in words:
    print('%s %s' % (word, flag))
 print('='*40)
 print('6. Tokenize: 返回词语在原文的起止位置')
 print('-'*40)
 print(' 默认模式')
 print('-'*40)
 result = jieba.tokenize('永和服装饰品有限公司')
 for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
 print('-'*40)
 print(' 搜索模式')
 print('-'*40)
 result = jieba.tokenize('永和服装饰品有限公司', mode='search')
 for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
--- a/test/parallel/test_disable_hmm.py
+++ b/test/parallel/test_disable_hmm.py
@ -0,0 +1,95 @@
 #encoding=utf-8
 from __future__ import print_function
 import sys
 sys.path.append("../../")
 import jieba
 jieba.enable_parallel(4)
 def cuttest(test_sent):
    result = jieba.cut(test_sent, HMM=False)
    for word in result:
        print(word, "/", end=' ')
    print("")
 if __name__ == "__main__":
    cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空，我爱北京，我爱Python和C++。")
    cuttest("我不喜欢日本和服。")
    cuttest("雷猴回归人间。")
    cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
    cuttest("我需要廉租房")
    cuttest("永和服装饰品有限公司")
    cuttest("我爱北京天安门")
    cuttest("abc")
    cuttest("隐马尔可夫")
    cuttest("雷猴是个好网站")
    cuttest("“Microsoft”一词由“MICROcomputer（微型计算机）”和“SOFTware（软件）”两部分组成")
    cuttest("草泥马和欺实马是今年的流行词汇")
    cuttest("伊藤洋华堂总府店")
    cuttest("中国科学院计算技术研究所")
    cuttest("罗密欧与朱丽叶")
    cuttest("我购买了道具和服装")
    cuttest("PS: 我觉得开源有一个好处，就是能够敦促自己不断改进，避免敞帚自珍")
    cuttest("湖北省石首市")
    cuttest("湖北省十堰市")
    cuttest("总经理完成了这件事情")
    cuttest("电脑修好了")
    cuttest("做好了这件事情就一了百了了")
    cuttest("人们审美的观点是不同的")
    cuttest("我们买了一个美的空调")
    cuttest("线程初始化时我们要注意")
    cuttest("一个分子是由好多原子组织成的")
    cuttest("祝你马到功成")
    cuttest("他掉进了无底洞里")
    cuttest("中国的首都是北京")
    cuttest("孙君意")
    cuttest("外交部发言人马朝旭")
    cuttest("领导人会议和第四届东亚峰会")
    cuttest("在过去的这五年")
    cuttest("还需要很长的路要走")
    cuttest("60周年首都阅兵")
    cuttest("你好人们审美的观点是不同的")
    cuttest("买水果然后来世博园")
    cuttest("买水果然后去世博园")
    cuttest("但是后来我才知道你是对的")
    cuttest("存在即合理")
    cuttest("的的的的的在的的的的就以和和和")
    cuttest("I love你，不以为耻，反以为rong")
    cuttest("因")
    cuttest("")
    cuttest("hello你好人们审美的观点是不同的")
    cuttest("很好但主要是基于网页形式")
    cuttest("hello你好人们审美的观点是不同的")
    cuttest("为什么我不能拥有想要的生活")
    cuttest("后来我才")
    cuttest("此次来中国是为了")
    cuttest("使用了它就可以解决一些问题")
    cuttest(",使用了它就可以解决一些问题")
    cuttest("其实使用了它就可以解决一些问题")
    cuttest("好人使用了它就可以解决一些问题")
    cuttest("是因为和国家")
    cuttest("老年搜索还支持")
    cuttest("干脆就把那部蒙人的闲法给废了拉倒！RT @laoshipukong : 27日，全国人大常委会第三次审议侵权责任法草案，删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
    cuttest("大")
    cuttest("")
    cuttest("他说的确实在理")
    cuttest("长春市长春节讲话")
    cuttest("结婚的和尚未结婚的")
    cuttest("结合成分子时")
    cuttest("旅游和服务是最好的")
    cuttest("这件事情的确是我的错")
    cuttest("供大家参考指正")
    cuttest("哈尔滨政府公布塌桥原因")
    cuttest("我在机场入口处")
    cuttest("邢永臣摄影报道")
    cuttest("BP神经网络如何训练才能在分类时增加区分度？")
    cuttest("南京市长江大桥")
    cuttest("应一些使用者的建议，也为了便于利用NiuTrans用于SMT研究")
    cuttest('长春市长春药店')
    cuttest('邓颖超生前最喜欢的衣服')
    cuttest('胡锦涛是热爱世界和平的政治局常委')
    cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
    cuttest('一次性交多少钱')
    cuttest('两块五一套，三块八一斤，四块七一本，五块六一条')
    cuttest('小和尚留了一个像大和尚一样的和尚头')
    cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
--- a/test/test.py
+++ b/test/test.py
@ -98,3 +98,5 @@ if __name__ == "__main__":
    cuttest('张三风同学走上了不归路')
    cuttest('阿Q腰间挂着BB机手里拿着大哥大，说：我一般吃饭不AA制的。')
    cuttest('在1号店能买到小S和大S八卦的书，还有3D电视。')
    jieba.del_word('很赞')
    cuttest('看上去iphone8手机样式很赞,售价699美元,销量涨了5%么？')
--- a/test/test_cutall.py
+++ b/test/test_cutall.py
@ -96,3 +96,6 @@ if __name__ == "__main__":
    cuttest('AT&T是一件不错的公司，给你发offer了吗？')
    cuttest('C++和c#是什么关系？11+122=133，是吗？PI=3.14159')
    cuttest('你认识那个和主席握手的的哥吗？他开一辆黑色的士。')
    jieba.add_word('超敏C反应蛋白')
    cuttest('超敏C反应蛋白是什么, java好学吗?,小潘老板都学Python')
    cuttest('steel健身爆发力运动兴奋补充剂')
--- a/test/test_lock.py
+++ b/test/test_lock.py
@ -0,0 +1,42 @@
 #!/usr/bin/env python
 # -*- coding: utf-8 -*-
 import jieba
 import threading
 def inittokenizer(tokenizer, group):
 	print('===> Thread %s:%s started' % (group, threading.current_thread().ident))
 	tokenizer.initialize()
 	print('<=== Thread %s:%s finished' % (group, threading.current_thread().ident))
 tokrs1 = [jieba.Tokenizer() for n in range(5)]
 tokrs2 = [jieba.Tokenizer('../extra_dict/dict.txt.small') for n in range(5)]
 thr1 = [threading.Thread(target=inittokenizer, args=(tokr, 1)) for tokr in tokrs1]
 thr2 = [threading.Thread(target=inittokenizer, args=(tokr, 2)) for tokr in tokrs2]
 for thr in thr1:
 	thr.start()
 for thr in thr2:
 	thr.start()
 for thr in thr1:
 	thr.join()
 for thr in thr2:
 	thr.join()
 del tokrs1, tokrs2
 print('='*40)
 tokr1 = jieba.Tokenizer()
 tokr2 = jieba.Tokenizer('../extra_dict/dict.txt.small')
 thr1 = [threading.Thread(target=inittokenizer, args=(tokr1, 1)) for n in range(5)]
 thr2 = [threading.Thread(target=inittokenizer, args=(tokr2, 2)) for n in range(5)]
 for thr in thr1:
 	thr.start()
 for thr in thr2:
 	thr.start()
 for thr in thr1:
 	thr.join()
 for thr in thr2:
 	thr.join()
--- a/test/test_paddle.py
+++ b/test/test_paddle.py
@ -0,0 +1,102 @@
 #encoding=utf-8
 import sys
 sys.path.append("../")
 import jieba
 jieba.enable_paddle()
 def cuttest(test_sent):
    result = jieba.cut(test_sent, use_paddle=True)
    print(" / ".join(result))
 if __name__ == "__main__":
    cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空，我爱北京，我爱Python和C++。")
    cuttest("我不喜欢日本和服。")
    cuttest("雷猴回归人间。")
    cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
    cuttest("我需要廉租房")
    cuttest("永和服装饰品有限公司")
    cuttest("我爱北京天安门")
    cuttest("abc")
    cuttest("隐马尔可夫")
    cuttest("雷猴是个好网站")
    cuttest("“Microsoft”一词由“MICROcomputer（微型计算机）”和“SOFTware（软件）”两部分组成")
    cuttest("草泥马和欺实马是今年的流行词汇")
    cuttest("伊藤洋华堂总府店")
    cuttest("中国科学院计算技术研究所")
    cuttest("罗密欧与朱丽叶")
    cuttest("我购买了道具和服装")
    cuttest("PS: 我觉得开源有一个好处，就是能够敦促自己不断改进，避免敞帚自珍")
    cuttest("湖北省石首市")
    cuttest("湖北省十堰市")
    cuttest("总经理完成了这件事情")
    cuttest("电脑修好了")
    cuttest("做好了这件事情就一了百了了")
    cuttest("人们审美的观点是不同的")
    cuttest("我们买了一个美的空调")
    cuttest("线程初始化时我们要注意")
    cuttest("一个分子是由好多原子组织成的")
    cuttest("祝你马到功成")
    cuttest("他掉进了无底洞里")
    cuttest("中国的首都是北京")
    cuttest("孙君意")
    cuttest("外交部发言人马朝旭")
    cuttest("领导人会议和第四届东亚峰会")
    cuttest("在过去的这五年")
    cuttest("还需要很长的路要走")
    cuttest("60周年首都阅兵")
    cuttest("你好人们审美的观点是不同的")
    cuttest("买水果然后来世博园")
    cuttest("买水果然后去世博园")
    cuttest("但是后来我才知道你是对的")
    cuttest("存在即合理")
    cuttest("的的的的的在的的的的就以和和和")
    cuttest("I love你，不以为耻，反以为rong")
    cuttest("因")
    cuttest("")
    cuttest("hello你好人们审美的观点是不同的")
    cuttest("很好但主要是基于网页形式")
    cuttest("hello你好人们审美的观点是不同的")
    cuttest("为什么我不能拥有想要的生活")
    cuttest("后来我才")
    cuttest("此次来中国是为了")
    cuttest("使用了它就可以解决一些问题")
    cuttest(",使用了它就可以解决一些问题")
    cuttest("其实使用了它就可以解决一些问题")
    cuttest("好人使用了它就可以解决一些问题")
    cuttest("是因为和国家")
    cuttest("老年搜索还支持")
    cuttest("干脆就把那部蒙人的闲法给废了拉倒！RT @laoshipukong : 27日，全国人大常委会第三次审议侵权责任法草案，删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
    cuttest("大")
    cuttest("")
    cuttest("他说的确实在理")
    cuttest("长春市长春节讲话")
    cuttest("结婚的和尚未结婚的")
    cuttest("结合成分子时")
    cuttest("旅游和服务是最好的")
    cuttest("这件事情的确是我的错")
    cuttest("供大家参考指正")
    cuttest("哈尔滨政府公布塌桥原因")
    cuttest("我在机场入口处")
    cuttest("邢永臣摄影报道")
    cuttest("BP神经网络如何训练才能在分类时增加区分度？")
    cuttest("南京市长江大桥")
    cuttest("应一些使用者的建议，也为了便于利用NiuTrans用于SMT研究")
    cuttest('长春市长春药店')
    cuttest('邓颖超生前最喜欢的衣服')
    cuttest('胡锦涛是热爱世界和平的政治局常委')
    cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
    cuttest('一次性交多少钱')
    cuttest('两块五一套，三块八一斤，四块七一本，五块六一条')
    cuttest('小和尚留了一个像大和尚一样的和尚头')
    cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
    cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
    cuttest('AT&T是一件不错的公司，给你发offer了吗？')
    cuttest('C++和c#是什么关系？11+122=133，是吗？PI=3.14159')
    cuttest('你认识那个和主席握手的的哥吗？他开一辆黑色的士。')
    cuttest('枪杆子中出政权')
    cuttest('张三风同学走上了不归路')
    cuttest('阿Q腰间挂着BB机手里拿着大哥大，说：我一般吃饭不AA制的。')
    cuttest('在1号店能买到小S和大S八卦的书，还有3D电视。')
    jieba.del_word('很赞')
    cuttest('看上去iphone8手机样式很赞,售价699美元,销量涨了5%么？')
--- a/test/test_paddle_postag.py
+++ b/test/test_paddle_postag.py
@ -0,0 +1,102 @@
 #encoding=utf-8
 import sys
 sys.path.append("../")
 import jieba.posseg as pseg
 import jieba
 jieba.enable_paddle()
 def cuttest(test_sent):
    result = pseg.cut(test_sent, use_paddle=True)
    for word, flag in result:
        print('%s %s' % (word, flag))
 if __name__ == "__main__":
    cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空，我爱北京，我爱Python和C++。")
    cuttest("我不喜欢日本和服。")
    cuttest("雷猴回归人间。")
    cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
    cuttest("我需要廉租房")
    cuttest("永和服装饰品有限公司")
    cuttest("我爱北京天安门")
    cuttest("abc")
    cuttest("隐马尔可夫")
    cuttest("雷猴是个好网站")
    cuttest("“Microsoft”一词由“MICROcomputer（微型计算机）”和“SOFTware（软件）”两部分组成")
    cuttest("草泥马和欺实马是今年的流行词汇")
    cuttest("伊藤洋华堂总府店")
    cuttest("中国科学院计算技术研究所")
    cuttest("罗密欧与朱丽叶")
    cuttest("我购买了道具和服装")
    cuttest("PS: 我觉得开源有一个好处，就是能够敦促自己不断改进，避免敞帚自珍")
    cuttest("湖北省石首市")
    cuttest("湖北省十堰市")
    cuttest("总经理完成了这件事情")
    cuttest("电脑修好了")
    cuttest("做好了这件事情就一了百了了")
    cuttest("人们审美的观点是不同的")
    cuttest("我们买了一个美的空调")
    cuttest("线程初始化时我们要注意")
    cuttest("一个分子是由好多原子组织成的")
    cuttest("祝你马到功成")
    cuttest("他掉进了无底洞里")
    cuttest("中国的首都是北京")
    cuttest("孙君意")
    cuttest("外交部发言人马朝旭")
    cuttest("领导人会议和第四届东亚峰会")
    cuttest("在过去的这五年")
    cuttest("还需要很长的路要走")
    cuttest("60周年首都阅兵")
    cuttest("你好人们审美的观点是不同的")
    cuttest("买水果然后来世博园")
    cuttest("买水果然后去世博园")
    cuttest("但是后来我才知道你是对的")
    cuttest("存在即合理")
    cuttest("的的的的的在的的的的就以和和和")
    cuttest("I love你，不以为耻，反以为rong")
    cuttest("因")
    cuttest("")
    cuttest("hello你好人们审美的观点是不同的")
    cuttest("很好但主要是基于网页形式")
    cuttest("hello你好人们审美的观点是不同的")
    cuttest("为什么我不能拥有想要的生活")
    cuttest("后来我才")
    cuttest("此次来中国是为了")
    cuttest("使用了它就可以解决一些问题")
    cuttest(",使用了它就可以解决一些问题")
    cuttest("其实使用了它就可以解决一些问题")
    cuttest("好人使用了它就可以解决一些问题")
    cuttest("是因为和国家")
    cuttest("老年搜索还支持")
    cuttest("干脆就把那部蒙人的闲法给废了拉倒！RT @laoshipukong : 27日，全国人大常委会第三次审议侵权责任法草案，删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
    cuttest("大")
    cuttest("")
    cuttest("他说的确实在理")
    cuttest("长春市长春节讲话")
    cuttest("结婚的和尚未结婚的")
    cuttest("结合成分子时")
    cuttest("旅游和服务是最好的")
    cuttest("这件事情的确是我的错")
    cuttest("供大家参考指正")
    cuttest("哈尔滨政府公布塌桥原因")
    cuttest("我在机场入口处")
    cuttest("邢永臣摄影报道")
    cuttest("BP神经网络如何训练才能在分类时增加区分度？")
    cuttest("南京市长江大桥")
    cuttest("应一些使用者的建议，也为了便于利用NiuTrans用于SMT研究")
    cuttest('长春市长春药店')
    cuttest('邓颖超生前最喜欢的衣服')
    cuttest('胡锦涛是热爱世界和平的政治局常委')
    cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
    cuttest('一次性交多少钱')
    cuttest('两块五一套，三块八一斤，四块七一本，五块六一条')
    cuttest('小和尚留了一个像大和尚一样的和尚头')
    cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
    cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
    cuttest('AT&T是一件不错的公司，给你发offer了吗？')
    cuttest('C++和c#是什么关系？11+122=133，是吗？PI=3.14159')
    cuttest('你认识那个和主席握手的的哥吗？他开一辆黑色的士。')
    cuttest('枪杆子中出政权')
    cuttest('张三风同学走上了不归路')
    cuttest('阿Q腰间挂着BB机手里拿着大哥大，说：我一般吃饭不AA制的。')
    cuttest('在1号店能买到小S和大S八卦的书，还有3D电视。')
--- a/test/test_pos.py
+++ b/test/test_pos.py
@ -6,8 +6,8 @@ import jieba.posseg as pseg
 def cuttest(test_sent):
    result = pseg.cut(test_sent)
-    for w in result:
+    for word, flag in result:
-        print(w.word, "/", w.flag, ", ", end=' ')
+        print(word, "/", flag, ", ", end=' ')
    print("")
--- a/test/test_pos_no_hmm.py
+++ b/test/test_pos_no_hmm.py
@ -6,8 +6,8 @@ import jieba.posseg as pseg
 def cuttest(test_sent):
    result = pseg.cut(test_sent, HMM=False)
-    for w in result:
+    for word, flag in result:
-        print(w.word, "/", w.flag, ", ", end=' ')  
+        print(word, "/", flag, ", ", end=' ')
    print("")
--- a/test/test_userdict.py
+++ b/test/test_userdict.py
@ -43,6 +43,6 @@ testlist = [
 for sent, seg in testlist:
    print('/'.join(jieba.cut(sent, HMM=False)))
    word = ''.join(seg)
-    print('%s Before: %s, After: %s' % (word, jieba.FREQ[word], jieba.suggest_freq(seg, True)))
+    print('%s Before: %s, After: %s' % (word, jieba.get_FREQ(word), jieba.suggest_freq(seg, True)))
    print('/'.join(jieba.cut(sent, HMM=False)))
    print("-"*40)
--- a/test/test_whoosh.py
+++ b/test/test_whoosh.py
@ -6,7 +6,7 @@ from whoosh.index import create_in,open_dir
 from whoosh.fields import *
 from whoosh.qparser import QueryParser
-from jieba.analyse import ChineseAnalyzer
+from jieba.analyse.analyzer import ChineseAnalyzer
 analyzer = ChineseAnalyzer()
--- a/test/userdict.txt
+++ b/test/userdict.txt
@ -6,3 +6,5 @@ easy_install 3 eng
 韩玉赏鉴 3 nz
 八一双鹿 3 nz
 台中
 凱特琳 nz
 Edu Trust认证 2000
Author	SHA1	Message	Date
Neutrino	67fa2e36e7	Update README.md update paddle link. (#817 )	2020-02-15 16:33:35 +08:00
fxsjy	1e20c89b66	fix setup.py in python2.7	2020-01-20 22:22:34 +08:00
fxsjy	5704e23bbf	update version: 0.42	2020-01-13 21:24:45 +08:00
fxsjy	aa65031788	fix file mode	2020-01-13 21:03:38 +08:00
fxsjy	2eb11c8028	fix issue #810	2020-01-13 20:53:43 +08:00
JesseyXujin	d703bce302	paddle coredump exception fix (#807 ) * paddle_null_point_fix * add core expception note * delete yield * modify test paddle for supporting enable_paddle()	2020-01-10 16:30:46 +08:00
vissssa	dc2b788eb3	refactor: improvement check_paddle_installed (#806 )	2020-01-09 19:23:11 +08:00
fxsjy	0868c323d9	update version in __init__.py	2020-01-08 16:21:07 +08:00
fxsjy	eb37e048da	update version to 0.41	2020-01-08 16:04:30 +08:00
JesseyXujin	381b0691ac	Add enable_paddle interface to install paddle and import packages (#802 ) * enable_paddle_interface * Add enable_paddle interface to install paddle and import packages * Add enable_paddle interface to install paddle and import packages * add posseg lcut for paddle mode * fix vocabulary	2020-01-08 15:26:12 +08:00
fxsjy	97c32464e1	fix issue #798	2020-01-03 14:10:48 +08:00
Tim Gates	0489a6979e	Fix simple typo: vocabuary -> vocabulary (#797 ) Closes #796	2020-01-02 10:26:00 +08:00
JesseyXujin	30ea8f929e	Simplify Paddle import check (#795 )	2019-12-31 15:03:14 +08:00
JesseyXujin	0b74b6c2de	add jieba upgrade not in README.md and change import imp to import importlib in _compat.py (#794 )	2019-12-31 14:14:50 +08:00
Sun Junyi	2fdee89883	Update README.md	2019-12-30 17:11:22 +08:00
JesseyXujin	17bab6a2d1	修改paddle版本检测报错机制 (#790 )	2019-12-25 19:46:49 +08:00
Sun Junyi	80947ff843	Update Changelog	2019-12-25 10:49:02 +08:00
fxsjy	68ce6955b7	update version to 0.40	2019-12-25 10:35:22 +08:00
fxsjy	d47e14e5b3	update version	2019-12-25 10:34:18 +08:00
pkpk	27910094ac	Fix bugs in Paddle seg and Paddle postag (#789 ) * fix bugs in paddle seg and paddle postag * fix compat in checking paddle	2019-12-24 21:02:55 +08:00
Sun Junyi	9dc8e6d992	Update README.md	2019-12-24 19:29:17 +08:00
fxsjy	478c3b9bb4	lazy import paddle	2019-12-24 19:19:51 +08:00
JesseyXujin	5b3bb4b7f2	加入paddle分词和词性标注功能 (#788 ) * paddle cut release * 修改README.md，提示用户安装paddlepaddle.tiny * 删除两个init.py文件中utf头文件 * 修改readme细节	2019-12-24 17:27:41 +08:00
Hongxiang Lin	38134ee20f	修复suggest_freq中add_word指向的bug (#723 )	2019-07-01 19:43:45 +08:00
Paul Meng	3645a5bb5d	Update README.md (#745 )	2019-07-01 19:41:47 +08:00
Sun Junyi	8212b6c572	Update README.md	2018-12-03 16:29:32 +08:00
Sun Junyi	843cdc2b7c	Merge pull request #582 from hosiet/pr-fix-typo-codespell Fix typos found by codespell	2018-09-20 10:44:47 +08:00
Sun Junyi	68f2a64f7e	Merge pull request #663 from JimCurryWang/patch-1 Fix __init__ "-" symbol issue	2018-09-20 10:40:35 +08:00
Sun Junyi	4c8479cfa6	Merge pull request #667 from ZhengZixiang/patch-1 fix the error about importing ChineseAnalyzer	2018-09-20 10:39:29 +08:00
imzhengzx	ca444fb4da	fix the error about imoprting ChineseAnalyzer Because of the interface change about ChineseAnlayzer , the code 'from jieba.analyse import Chinese Analyzer' in this test file would report an ImportError like 'cannot import name 'ChineseAnalyzer'. Just change import code to 'from jieba.analyse.analyzer import ChineseAnalyzer' can fix it.	2018-09-15 11:59:01 +08:00
CY Wang	36a27302ce	Fix __init__ "-" symbol issue Solving "-" symbol can't be analyze issue . For example, In keyword , chap-EX喬沛詩 , SK-II ...etc the present version will show "chap", "-", "EX喬沛詩" , "SK", "-", "II" After the modify, The new version will show "chap-EX","喬沛詩" , "SK-II" ps: I have used the jieba.load_userdict() , and added "chap-EX" , "喬沛詩", "SK-II" in the userdict.txt.	2018-08-27 17:05:46 +08:00
Sun Junyi	7653db2e33	Update README.md	2018-07-04 17:18:02 +08:00
Boyuan Yang	17ef8abba3	Fix typos found by codespell	2018-01-21 19:15:48 +08:00
fxsjy	cb0de2973b	version change 0.39	2017-08-28 21:40:18 +08:00
sunjunyi01	b4dd5b58f3	bug fix, issue: #511 , #512	2017-08-28 21:10:50 +08:00
Sun Junyi	4eef868338	Merge pull request #455 from OOCZC/master Update README.md	2017-04-06 15:22:01 +08:00
OOC	b485ae916c	Update README.md	2017-04-04 11:45:53 +08:00
OOC	ee0ce32bbd	Update	2017-04-04 11:17:44 +08:00
Sun Junyi	8ba26cf97e	Merge pull request #382 from huntzhan/master Bugfix for HMM=False in parallelism.	2016-08-05 10:02:41 +08:00
huntzhan	60acefd9b1	Bugfix for HMM=False in parallelism.	2016-08-04 17:43:35 +08:00
Sun Junyi	03cd4b5fb6	Merge pull request #367 from yanyiwu/patch-1 Update README.md	2016-06-12 09:37:16 +08:00
Yanyi Wu	76ae798137	Update README.md	2016-06-10 22:48:01 +08:00
Sun Junyi	0243d568e9	Merge pull request #351 from gumblex/master fix del_word	2016-03-16 10:22:34 +08:00
Dingyuan Wang	12b2b17741	fix del_word	2016-03-15 18:58:12 +08:00
fxsjy	1d5ea9f061	version change 0.38	2015-12-16 16:12:49 +08:00
Sun Junyi	e5c9af78e2	Merge pull request #315 from gumblex/master 命令行分词支持词性标注	2015-11-17 19:13:36 +08:00
Dingyuan Wang	87734d3785	support POS tagging in __main__	2015-11-17 19:06:44 +08:00
Sun Junyi	3d29b0c8e8	Merge pull request #310 from gumblex/master Fix compatibility problem with `with` statememt	2015-11-13 14:22:50 +08:00
Dingyuan Wang	1fcd3a417c	fix compatibility problem with `with` statememt	2015-11-13 13:16:19 +08:00
Sun Junyi	093980647b	Merge pull request #303 from jerryday/master add a withFlag param to extract_tags	2015-11-13 10:19:53 +08:00
Sun Junyi	f73a2183a5	Merge pull request #309 from gumblex/master 用 pkg_resources 载入默认字典	2015-11-13 10:18:50 +08:00
Dingyuan Wang	8814e08f9b	load default dictionary from pkg_resources and improve the loading method; change the serialized models from marshal to pickle	2015-11-12 20:18:09 +08:00
Sun Junyi	70f019b669	Merge pull request #307 from gumblex/master 扩充汉字范围；修正 load_userdict	2015-11-09 22:12:23 +08:00
Dingyuan Wang	5270ed66ff	fix typo in type detection in load_userdict	2015-11-09 21:37:29 +08:00
Dingyuan Wang	99d0fb1a8a	use regex and fix encoding related issues in load_userdict	2015-11-09 20:54:50 +08:00
Dingyuan Wang	1c33252fce	change the recognized Chinese character range to [\u4E00-\u9FD5]	2015-11-09 20:23:43 +08:00
jerryday	e5e41a4aad	fix pair object in dict problem	2015-10-30 16:38:50 +08:00
jerryday	4f8ca83661	add a withFlag param in textrank	2015-10-30 15:40:41 +08:00
jerryday	26e339f8f7	add a withFlag param to extract_tags	2015-10-30 11:09:24 +08:00
Sun Junyi	b6f1ce773e	Merge pull request #298 from anderscui/master Add introduction to jieba.NET port.	2015-09-23 06:54:56 +08:00
andersc	343bfe9783	Add introduction to jieba.NET port.	2015-09-22 23:16:02 +08:00
fxsjy	cb414cb861	version update	2015-06-27 16:49:44 +08:00
Sun Junyi	8e99a13aa9	Merge pull request #275 from gumblex/master 防止跨文件系统创建缓存	2015-06-26 23:22:42 +08:00
Dingyuan Wang	d0e68974bf	improved doc for tmp_dir and cache_file	2015-06-26 22:24:20 +08:00
Dingyuan Wang	66fe17517d	prevent moving across different filesystems at tempfile.mkstemp	2015-06-26 22:12:39 +08:00
Dingyuan Wang	be46ddef9a	use shutil.move for all platforms in case of different filesystems	2015-06-26 21:52:53 +08:00
Sun Junyi	17652e764f	Merge pull request #271 from gumblex/master 修复 cut_for_search；改善 pair 对象	2015-06-01 18:40:31 +08:00
Dingyuan Wang	ceb5c26be4	fix self.FREQ in cut_for_search; make pair object iterable	2015-06-01 14:36:38 +08:00
Sun Junyi	9f4d9376b0	Merge pull request #269 from gumblex/master 自定义字典允许指定词性同时省略词频	2015-05-24 19:56:51 +08:00
Dingyuan Wang	3b76328f2a	allow ignoring word frequency while providing pos tag	2015-05-23 21:51:00 +08:00
Sun Junyi	3ec4c43788	Merge pull request #260 from gumblex/master 使用类包装全局函数	2015-05-11 10:24:49 +08:00
Dingyuan Wang	94840a734c	wraps most globals in classes API changes: * class jieba.Tokenizer, jieba.posseg.POSTokenizer * class jieba.analyse.TFIDF, jieba.analyse.TextRank * global functions are mapped to jieba.(posseg.)dt, the default (POS)Tokenizer * multiprocessing only works with jieba.(posseg.)dt * new lcut, lcut_for_search functions that returns a list * jieba.analyse.textrank now returns 20 items by default Tests: * added test_lock.py to test multithread locking * demo.py now contains most of the examples in README	2015-05-09 21:29:05 +08:00
Sun Junyi	e359d08964	Merge pull request #257 from gip0/gip0-patch-1 fixed an error in load_userdict()	2015-05-02 17:27:16 +08:00
Gilbert Liu	f6e57ab2ae	fixed an error in load_userdict()	2015-05-01 12:52:28 -07:00
Sun Junyi	60f0028175	Merge pull request #252 from fukuball/master 更新 README	2015-04-28 22:42:40 +08:00
Fukuball Lin	e712a4de61	更新 README 增加结巴分词 PHP 版本相關資訊	2015-04-28 22:05:26 +08:00
fxsjy	29d2b838dc	a minor version on pypi, which removes *.pyc	2015-04-17 19:35:12 +08:00
fxsjy	c07b7fef54	hot-fix version for pull request #248	2015-04-10 18:54:51 +08:00
Sun Junyi	753c1be49c	Merge pull request #248 from wangbin/master exlucde word fragments from FREQ in posseg.cut	2015-04-02 15:32:41 +08:00
Wang Bin	84ffa0d4bf	exlucde word fragments from FREQ	2015-04-02 11:06:55 +08:00
+	a-B
+	a-I
+	ad-B
+	ad-I
+	an-B
+	an-I
+	c-B
+	c-I
+	d-B
+	d-I
+	f-B
+	f-I
+	m-B
+	m-I
+	n-B
+	n-I
+	nr-B
+	nr-I
+	ns-B
+	ns-I
+	nt-B
+	nt-I
+	nw-B
+	nw-I
+	nz-B
+	nz-I
+	p-B
+	p-I
+	q-B
+	q-I
+	r-B
+	r-I
+	s-B
+	s-I
+	t-B
+	t-I
+	u-B
+	u-I
+	v-B
+	v-I
+	vd-B
+	vd-I
+	vn-B
+	vn-I
+	w-B
+	w-I
+	xc-B
+	xc-I
+	PER-B
+	PER-I
+	LOC-B
+	LOC-I
+	ORG-B
+	ORG-I
+	TIME-B
+	TIME-I
+	O