Update README.md update paddle link. (#817 )

fix setup.py in python2.7
update version: 0.42
2025-07-10 00:01:33 +08:00 · 2020-02-15 16:33:35 +08:00 · 2020-01-20 22:22:34 +08:00 · 2020-01-13 21:24:45 +08:00 · 2020-01-13 21:03:38 +08:00 · 2020-01-13 20:53:43 +08:00
38 changed files with 21882 additions and 53 deletions
--- a/18
+++ b/18
@ -1,3 +1,21 @@
 2019-1-20: version 0.42.1
 1. 修复setup.py在python2.7版本无法工作的问题 (issue #809)
 2019-1-13: version 0.42
 1. 修复paddle模式空字符串coredump问题 @JesseyXujin
 2. 修复cut_all模式切分丢字问题 @fxsjy
 3. paddle安装检测优化 @vissssa
 2019-1-8: version 0.41
 1. 开启paddle模式更友好
 2. 修复cut_all模式不支持中英混合词的bug
 2019-12-25: version 0.40
 1. 支持基于paddle的深度学习分词模式(use_paddle=True); by @JesseyXujin, @xyzhou-puck
 2. 修复自定义Tokenizer实例的add_word方法指向全局的问题; by @linhx13 
 3. 修复whoosh测试用例的引用bug; by @ZhengZixiang
 4. 修复自定义词库不支持含"-"符号的问题；by @JimCurryWang 
 2017-08-28: version 0.39
 1. del_word支持强行拆开词语;  by @gumblex,@fxsjy
 2. 修复百分数的切词; by @fxsjy
--- a/README.md
+++ b/README.md
@ -9,24 +9,15 @@ jieba
 特点
 ========
-* 支持三种分词模式：
+* 支持四种分词模式：
    * 精确模式，试图将句子最精确地切开，适合文本分析；
    * 全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义；
    * 搜索引擎模式，在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词。
-
+    * paddle模式，利用PaddlePaddle深度学习框架，训练序列标注（双向GRU）网络模型实现分词。同时支持词性标注。paddle模式使用需安装paddlepaddle-tiny，`pip install paddlepaddle-tiny==1.6.1`。目前paddle模式支持jieba v0.40及以上版本。jieba v0.40以下版本，请升级jieba，`pip install jieba --upgrade` 。[PaddlePaddle官网](https://www.paddlepaddle.org.cn/)
 * 支持繁体分词
 * 支持自定义词典
 * MIT 授权协议
 在线演示
 =========
 http://jiebademo.ap01.aws.af.cm/
 (Powered by Appfog)
 网站代码：https://github.com/fxsjy/jiebademo
 安装说明
 =======
@ -36,6 +27,7 @@ http://jiebademo.ap01.aws.af.cm/
 * 半自动安装：先下载 http://pypi.python.org/pypi/jieba/ ，解压后运行 `python setup.py install`
 * 手动安装：将 jieba 目录放置于当前目录或者 site-packages 目录
 * 通过 `import jieba` 来引用
 * 如果需要使用paddle模式下的分词和词性标注功能，请先安装paddlepaddle-tiny，`pip install paddlepaddle-tiny==1.6.1`。
 算法
 ========
@ -47,7 +39,7 @@ http://jiebademo.ap01.aws.af.cm/
 =======
 1. 分词
 --------
-* `jieba.cut` 方法接受三个输入参数: 需要分词的字符串；cut_all 参数用来控制是否采用全模式；HMM 参数用来控制是否使用 HMM 模型
+* `jieba.cut` 方法接受四个输入参数: 需要分词的字符串；cut_all 参数用来控制是否采用全模式；HMM 参数用来控制是否使用 HMM 模型；use_paddle 参数用来控制是否使用paddle模式下的分词模式，paddle模式采用延迟加载方式，通过enable_paddle接口安装paddlepaddle-tiny，并且import相关代码；
 * `jieba.cut_for_search` 方法接受两个参数：需要分词的字符串；是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词，粒度比较细
 * 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意：不建议直接输入 GBK 字符串，可能无法预料地错误解码成 UTF-8
 * `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator，可以使用 for 循环来获得分词后得到的每一个词语(unicode)，或者用
@ -60,6 +52,12 @@ http://jiebademo.ap01.aws.af.cm/
 # encoding=utf-8
 import jieba
 jieba.enable_paddle()# 启动paddle模式。 0.40版之后开始支持，早期版本不支持
 strs=["我来到北京清华大学","乒乓球拍卖完了","中国科学技术大学"]
 for str in strs:
    seg_list = jieba.cut(str,use_paddle=True) # 使用paddle模式
    print("Paddle Mode: " + '/'.join(list(seg_list)))
 seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
 print("Full Mode: " + "/ ".join(seg_list))  # 全模式
@ -195,11 +193,15 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
 -----------
 * `jieba.posseg.POSTokenizer(tokenizer=None)` 新建自定义分词器，`tokenizer` 参数可指定内部使用的 `jieba.Tokenizer` 分词器。`jieba.posseg.dt` 为默认词性标注分词器。
 * 标注句子分词后每个词的词性，采用和 ictclas 兼容的标记法。
 * 除了jieba默认分词模式，提供paddle模式下的词性标注功能。paddle模式采用延迟加载方式，通过enable_paddle()安装paddlepaddle-tiny，并且import相关代码；
 * 用法示例
 ```pycon
 >>> import jieba
 >>> import jieba.posseg as pseg
->>> words = pseg.cut("我爱北京天安门")
+>>> words = pseg.cut("我爱北京天安门") #jieba默认模式
 >>> jieba.enable_paddle() #启动paddle模式。 0.40版之后开始支持，早期版本不支持
 >>> words = pseg.cut("我爱北京天安门",use_paddle=True) #paddle模式
 >>> for word, flag in words:
 ...    print('%s %s' % (word, flag))
 ...
@ -209,6 +211,21 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
 天安门 ns
 ```
 paddle模式词性标注对应表如下：
 paddle模式词性和专名类别标签集合如下表，其中词性标签 24 个（小写字母），专名类别标签 4 个（大写字母）。
 | 标签 | 含义     | 标签 | 含义     | 标签 | 含义     | 标签 | 含义     |
 | ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
 | n    | 普通名词 | f    | 方位名词 | s    | 处所名词 | t    | 时间     |
 | nr   | 人名     | ns   | 地名     | nt   | 机构名   | nw   | 作品名   |
 | nz   | 其他专名 | v    | 普通动词 | vd   | 动副词   | vn   | 名动词   |
 | a    | 形容词   | ad   | 副形词   | an   | 名形词   | d    | 副词     |
 | m    | 数量词   | q    | 量词     | r    | 代词     | p    | 介词     |
 | c    | 连词     | u    | 助词     | xc   | 其他虚词 | w    | 标点符号 |
 | PER  | 人名     | LOC  | 地名     | ORG  | 机构名   | TIME | 时间     |
 5. 并行分词
 -----------
 * 原理：将目标文本按行分隔后，把各行文本分配到多个 Python 进程并行分词，然后归并结果，从而获得分词速度的可观提升
@ -362,6 +379,11 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
 作者：yanyiwu
 地址：https://github.com/yanyiwu/cppjieba
 结巴分词 Rust 版本
 ----------------
 作者：messense, MnO2
 地址：https://github.com/messense/jieba-rs
 结巴分词 Node.js 版本
 ----------------
 作者：yanyiwu
@ -398,6 +420,17 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
 + 作者: wangbin 地址: https://github.com/wangbin/jiebago
 + 作者: yanyiwu 地址: https://github.com/yanyiwu/gojieba
 结巴分词Android版本
 ------------------
 + 作者   Dongliang.W  地址：https://github.com/452896915/jieba-android
 友情链接
 =========
 * https://github.com/baidu/lac   百度中文词法分析（分词+词性+专名）系统
 * https://github.com/baidu/AnyQ  百度FAQ自动问答系统
 * https://github.com/baidu/Senta 百度情感识别系统
 系统集成
 ========
 1. Solr: https://github.com/sing1ee/jieba-solr
--- a/jieba/init.py
+++ b/jieba/init.py
@ -1,19 +1,18 @@
 from __future__ import absolute_import, unicode_literals
-__version__ = '0.39'
+
 __version__ = '0.42.1'
 __license__ = 'MIT'
 import re
 import os
 import sys
 import time
 import logging
 import marshal
 import re
 import tempfile
 import threading
-from math import log
+import time
 from hashlib import md5
-from ._compat import *
+from math import log
 from . import finalseg
 from ._compat import *
 if os.name == 'nt':
    from shutil import move as _replace_file
@ -40,15 +39,17 @@ re_eng = re.compile('[a-zA-Z0-9]', re.U)
 # \u4E00-\u9FD5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
 # \r\n|\s : whitespace characters. Will not be handled.
-re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
+# re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
 # Adding "-" symbol in re_han_default
 re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
 re_skip_default = re.compile("(\r\n|\s)", re.U)
-re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)
+
 re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)
 def setLogLevel(log_level):
    global logger
    default_logger.setLevel(log_level)
 class Tokenizer(object):
    def __init__(self, dictionary=DEFAULT_DICT):
@ -67,7 +68,8 @@ class Tokenizer(object):
    def __repr__(self):
        return '<Tokenizer dictionary=%r>' % self.dictionary
-    def gen_pfdict(self, f):
+    @staticmethod
    def gen_pfdict(f):
        lfreq = {}
        ltotal = 0
        f_name = resolve_filename(f)
@ -161,7 +163,7 @@ class Tokenizer(object):
            self.initialized = True
            default_logger.debug(
                "Loading model cost %.3f seconds." % (time.time() - t1))
-            default_logger.debug("Prefix dict has been built succesfully.")
+            default_logger.debug("Prefix dict has been built successfully.")
    def check_initialized(self):
        if not self.initialized:
@ -196,15 +198,30 @@ class Tokenizer(object):
    def __cut_all(self, sentence):
        dag = self.get_DAG(sentence)
        old_j = -1
        eng_scan = 0
        eng_buf = u''
        for k, L in iteritems(dag):
            if eng_scan == 1 and not re_eng.match(sentence[k]):
                eng_scan = 0
                yield eng_buf
            if len(L) == 1 and k > old_j:
-                yield sentence[k:L[0] + 1]
+                word = sentence[k:L[0] + 1]
                if re_eng.match(word):
                    if eng_scan == 0:
                        eng_scan = 1
                        eng_buf = word
                    else:
                        eng_buf += word
                if eng_scan == 0:
                    yield word
                old_j = L[0]
            else:
                for j in L:
                    if j > k:
                        yield sentence[k:j + 1]
                        old_j = j
        if eng_scan == 1:
            yield eng_buf
    def __cut_DAG_NO_HMM(self, sentence):
        DAG = self.get_DAG(sentence)
@ -269,22 +286,29 @@ class Tokenizer(object):
                for elem in buf:
                    yield elem
-    def cut(self, sentence, cut_all=False, HMM=True):
+    def cut(self, sentence, cut_all=False, HMM=True, use_paddle=False):
-        '''
+        """
        The main function that segments an entire sentence that contains
-        Chinese characters into seperated words.
+        Chinese characters into separated words.
        Parameter:
            - sentence: The str(unicode) to be segmented.
            - cut_all: Model type. True for full pattern, False for accurate pattern.
            - HMM: Whether to use the Hidden Markov Model.
-        '''
+        """
        is_paddle_installed = check_paddle_install['is_paddle_installed']
        sentence = strdecode(sentence)
-
+        if use_paddle and is_paddle_installed:
-        if cut_all:
+            # if sentence is null, it will raise core exception in paddle.
-            re_han = re_han_cut_all
+            if sentence is None or len(sentence) == 0:
-            re_skip = re_skip_cut_all
+                return
-        else:
+            import jieba.lac_small.predict as predict
            results = predict.get_sent(sentence)
            for sent in results:
                if sent is None:
                    continue
                yield sent
            return
        re_han = re_han_default
        re_skip = re_skip_default
        if cut_all:
@ -446,7 +470,7 @@ class Tokenizer(object):
                freq *= self.FREQ.get(seg, 1) / ftotal
            freq = min(int(freq * self.total), self.FREQ.get(word, 0))
        if tune:
-            add_word(word, freq)
+            self.add_word(word, freq)
        return freq
    def tokenize(self, unicode_sentence, mode="default", HMM=True):
--- a/jieba/_compat.py
+++ b/jieba/_compat.py
@ -1,15 +1,56 @@
 # -*- coding: utf-8 -*-
 import logging
 import os
 import sys
 log_console = logging.StreamHandler(sys.stderr)
 default_logger = logging.getLogger(__name__)
 default_logger.setLevel(logging.DEBUG)
 def setLogLevel(log_level):
    default_logger.setLevel(log_level)
 check_paddle_install = {'is_paddle_installed': False}
 try:
    import pkg_resources
    get_module_res = lambda *res: pkg_resources.resource_stream(__name__,
                                                                os.path.join(*res))
 except ImportError:
    get_module_res = lambda *res: open(os.path.normpath(os.path.join(
        os.getcwd(), os.path.dirname(__file__), *res)), 'rb')
 def enable_paddle():
    try:
        import paddle
    except ImportError:
        default_logger.debug("Installing paddle-tiny, please wait a minute......")
        os.system("pip install paddlepaddle-tiny")
        try:
            import paddle
        except ImportError:
            default_logger.debug(
                "Import paddle error, please use command to install: pip install paddlepaddle-tiny==1.6.1."
                "Now, back to jieba basic cut......")
    if paddle.__version__ < '1.6.1':
        default_logger.debug("Find your own paddle version doesn't satisfy the minimum requirement (1.6.1), "
                             "please install paddle tiny by 'pip install --upgrade paddlepaddle-tiny', "
                             "or upgrade paddle full version by "
                             "'pip install --upgrade paddlepaddle (-gpu for GPU version)' ")
    else:
        try:
            import jieba.lac_small.predict as predict
            default_logger.debug("Paddle enabled successfully......")
            check_paddle_install['is_paddle_installed'] = True
        except ImportError:
            default_logger.debug("Import error, cannot find paddle.fluid and jieba.lac_small.predict module. "
                                 "Now, back to jieba basic cut......")
 PY2 = sys.version_info[0] == 2
 default_encoding = sys.getfilesystemencoding()
@ -31,6 +72,7 @@ else:
    itervalues = lambda d: iter(d.values())
    iteritems = lambda d: iter(d.items())
 def strdecode(sentence):
    if not isinstance(sentence, text_type):
        try:
@ -39,6 +81,7 @@ def strdecode(sentence):
            sentence = sentence.decode('gbk', 'ignore')
    return sentence
 def resolve_filename(f):
    try:
        return f.name
--- a/jieba/lac_small/init.py
+++ b/jieba/lac_small/init.py
--- a/jieba/lac_small/creator.py
+++ b/jieba/lac_small/creator.py
@ -0,0 +1,46 @@
 # -*- coding: UTF-8 -*-
 #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Define the function to create lexical analysis model and model's data reader
 """
 import sys
 import os
 import math
 import paddle
 import paddle.fluid as fluid
 from paddle.fluid.initializer import NormalInitializer
 import jieba.lac_small.nets as nets
 def create_model(vocab_size, num_labels, mode='train'):
    """create lac model"""
    # model's input data
    words = fluid.data(name='words', shape=[-1, 1], dtype='int64', lod_level=1)
    targets = fluid.data(
        name='targets', shape=[-1, 1], dtype='int64', lod_level=1)
    # for inference process
    if mode == 'infer':
        crf_decode = nets.lex_net(
            words, vocab_size, num_labels, for_infer=True, target=None)
        return {
            "feed_list": [words],
            "words": words,
            "crf_decode": crf_decode,
        }
    return ret
--- a/jieba/lac_small/model_baseline/crfw
+++ b/jieba/lac_small/model_baseline/crfw
--- a/jieba/lac_small/model_baseline/fc_0.b_0
+++ b/jieba/lac_small/model_baseline/fc_0.b_0
--- a/jieba/lac_small/model_baseline/fc_0.w_0
+++ b/jieba/lac_small/model_baseline/fc_0.w_0
--- a/jieba/lac_small/model_baseline/fc_1.b_0
+++ b/jieba/lac_small/model_baseline/fc_1.b_0
--- a/jieba/lac_small/model_baseline/fc_1.w_0
+++ b/jieba/lac_small/model_baseline/fc_1.w_0
--- a/jieba/lac_small/model_baseline/fc_2.b_0
+++ b/jieba/lac_small/model_baseline/fc_2.b_0
--- a/jieba/lac_small/model_baseline/fc_2.w_0
+++ b/jieba/lac_small/model_baseline/fc_2.w_0
--- a/jieba/lac_small/model_baseline/fc_3.b_0
+++ b/jieba/lac_small/model_baseline/fc_3.b_0
--- a/jieba/lac_small/model_baseline/fc_3.w_0
+++ b/jieba/lac_small/model_baseline/fc_3.w_0
--- a/jieba/lac_small/model_baseline/fc_4.b_0
+++ b/jieba/lac_small/model_baseline/fc_4.b_0
--- a/jieba/lac_small/model_baseline/fc_4.w_0
+++ b/jieba/lac_small/model_baseline/fc_4.w_0
--- a/jieba/lac_small/model_baseline/gru_0.b_0
+++ b/jieba/lac_small/model_baseline/gru_0.b_0
--- a/jieba/lac_small/model_baseline/gru_0.w_0
+++ b/jieba/lac_small/model_baseline/gru_0.w_0
--- a/jieba/lac_small/model_baseline/gru_1.b_0
+++ b/jieba/lac_small/model_baseline/gru_1.b_0
--- a/jieba/lac_small/model_baseline/gru_1.w_0
+++ b/jieba/lac_small/model_baseline/gru_1.w_0
--- a/jieba/lac_small/model_baseline/gru_2.b_0
+++ b/jieba/lac_small/model_baseline/gru_2.b_0
--- a/jieba/lac_small/model_baseline/gru_2.w_0
+++ b/jieba/lac_small/model_baseline/gru_2.w_0
--- a/jieba/lac_small/model_baseline/gru_3.b_0
+++ b/jieba/lac_small/model_baseline/gru_3.b_0
--- a/jieba/lac_small/model_baseline/gru_3.w_0
+++ b/jieba/lac_small/model_baseline/gru_3.w_0
--- a/jieba/lac_small/model_baseline/word_emb
+++ b/jieba/lac_small/model_baseline/word_emb
--- a/jieba/lac_small/nets.py
+++ b/jieba/lac_small/nets.py
@ -0,0 +1,122 @@
 #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 The function lex_net(args) define the lexical analysis network structure
 """
 import sys
 import os
 import math
 import paddle.fluid as fluid
 from paddle.fluid.initializer import NormalInitializer
 def lex_net(word, vocab_size, num_labels, for_infer=True, target=None):
    """
    define the lexical analysis network structure
    word: stores the input of the model
    for_infer: a boolean value, indicating if the model to be created is for training or predicting.
    return:
        for infer: return the prediction
        otherwise: return the prediction
    """
    word_emb_dim=128
    grnn_hidden_dim=128
    bigru_num=2
    emb_lr = 1.0
    crf_lr = 1.0
    init_bound = 0.1
    IS_SPARSE = True
    def _bigru_layer(input_feature):
        """
        define the bidirectional gru layer
        """
        pre_gru = fluid.layers.fc(
            input=input_feature,
            size=grnn_hidden_dim * 3,
            param_attr=fluid.ParamAttr(
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound),
                regularizer=fluid.regularizer.L2DecayRegularizer(
                    regularization_coeff=1e-4)))
        gru = fluid.layers.dynamic_gru(
            input=pre_gru,
            size=grnn_hidden_dim,
            param_attr=fluid.ParamAttr(
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound),
                regularizer=fluid.regularizer.L2DecayRegularizer(
                    regularization_coeff=1e-4)))
        pre_gru_r = fluid.layers.fc(
            input=input_feature,
            size=grnn_hidden_dim * 3,
            param_attr=fluid.ParamAttr(
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound),
                regularizer=fluid.regularizer.L2DecayRegularizer(
                    regularization_coeff=1e-4)))
        gru_r = fluid.layers.dynamic_gru(
            input=pre_gru_r,
            size=grnn_hidden_dim,
            is_reverse=True,
            param_attr=fluid.ParamAttr(
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound),
                regularizer=fluid.regularizer.L2DecayRegularizer(
                    regularization_coeff=1e-4)))
        bi_merge = fluid.layers.concat(input=[gru, gru_r], axis=1)
        return bi_merge
    def _net_conf(word, target=None):
        """
        Configure the network
        """
        word_embedding = fluid.embedding(
            input=word,
            size=[vocab_size, word_emb_dim],
            dtype='float32',
            is_sparse=IS_SPARSE,
            param_attr=fluid.ParamAttr(
                learning_rate=emb_lr,
                name="word_emb",
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound)))
        input_feature = word_embedding
        for i in range(bigru_num):
            bigru_output = _bigru_layer(input_feature)
            input_feature = bigru_output
        emission = fluid.layers.fc(
            size=num_labels,
            input=bigru_output,
            param_attr=fluid.ParamAttr(
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound),
                regularizer=fluid.regularizer.L2DecayRegularizer(
                    regularization_coeff=1e-4)))
        size = emission.shape[1]
        fluid.layers.create_parameter(
            shape=[size + 2, size], dtype=emission.dtype, name='crfw')
        crf_decode = fluid.layers.crf_decoding(
            input=emission, param_attr=fluid.ParamAttr(name='crfw'))
        return crf_decode
    return _net_conf(word)
--- a/jieba/lac_small/predict.py
+++ b/jieba/lac_small/predict.py
@ -0,0 +1,82 @@
 # -*- coding: UTF-8 -*-
 #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import os
 import time
 import sys
 import paddle.fluid as fluid
 import paddle
 import jieba.lac_small.utils as utils
 import jieba.lac_small.creator as creator
 import jieba.lac_small.reader_small as reader_small
 import numpy
 word_emb_dim=128
 grnn_hidden_dim=128
 bigru_num=2
 use_cuda=False
 basepath = os.path.abspath(__file__)
 folder = os.path.dirname(basepath)
 init_checkpoint = os.path.join(folder, "model_baseline")
 batch_size=1
 dataset = reader_small.Dataset()
 infer_program = fluid.Program()
 with fluid.program_guard(infer_program, fluid.default_startup_program()):
    with fluid.unique_name.guard():
        infer_ret = creator.create_model(dataset.vocab_size, dataset.num_labels, mode='infer')
 infer_program = infer_program.clone(for_test=True)
 place = fluid.CPUPlace()
 exe = fluid.Executor(place)
 exe.run(fluid.default_startup_program())
 utils.init_checkpoint(exe, init_checkpoint, infer_program)
 results = []
 def get_sent(str1):
    feed_data=dataset.get_vars(str1)
    a = numpy.array(feed_data).astype(numpy.int64)
    a=a.reshape(-1,1)
    c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
    words, crf_decode = exe.run(
            infer_program,
            fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
            feed={"words":c, },
            return_numpy=False,
            use_program_cache=True)
    sents=[]
    sent,tag = utils.parse_result(words, crf_decode, dataset)
    sents = sents + sent
    return sents
 def get_result(str1):
    feed_data=dataset.get_vars(str1)
    a = numpy.array(feed_data).astype(numpy.int64)
    a=a.reshape(-1,1)
    c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
    words, crf_decode = exe.run(
            infer_program,
            fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
            feed={"words":c, },
            return_numpy=False,
            use_program_cache=True)
    results=[]
    results += utils.parse_result(words, crf_decode, dataset)
    return results
--- a/jieba/lac_small/reader_small.py
+++ b/jieba/lac_small/reader_small.py
@ -0,0 +1,100 @@
 #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 The file_reader converts raw corpus to input.
 """
 import os
 import __future__
 import io
 import paddle
 import paddle.fluid as fluid
 def load_kv_dict(dict_path,
                 reverse=False,
                 delimiter="\t",
                 key_func=None,
                 value_func=None):
    """
    Load key-value dict from file
    """
    result_dict = {}
    for line in io.open(dict_path, "r", encoding='utf8'):
        terms = line.strip("\n").split(delimiter)
        if len(terms) != 2:
            continue
        if reverse:
            value, key = terms
        else:
            key, value = terms
        if key in result_dict:
            raise KeyError("key duplicated with [%s]" % (key))
        if key_func:
            key = key_func(key)
        if value_func:
            value = value_func(value)
        result_dict[key] = value
    return result_dict
 class Dataset(object):
    """data reader"""
    def __init__(self):
        # read dict
        basepath = os.path.abspath(__file__)
        folder = os.path.dirname(basepath)
        word_dict_path = os.path.join(folder, "word.dic")
        label_dict_path = os.path.join(folder, "tag.dic")
        self.word2id_dict = load_kv_dict(
            word_dict_path, reverse=True, value_func=int)
        self.id2word_dict = load_kv_dict(word_dict_path)
        self.label2id_dict = load_kv_dict(
            label_dict_path, reverse=True, value_func=int)
        self.id2label_dict = load_kv_dict(label_dict_path)
    @property
    def vocab_size(self):
        """vocabulary size"""
        return max(self.word2id_dict.values()) + 1
    @property
    def num_labels(self):
        """num_labels"""
        return max(self.label2id_dict.values()) + 1
    def word_to_ids(self, words):
        """convert word to word index"""
        word_ids = []
        for word in words:
            if word not in self.word2id_dict:
                word = "OOV"
            word_id = self.word2id_dict[word]
            word_ids.append(word_id)
        return word_ids
    def label_to_ids(self, labels):
        """convert label to label index"""
        label_ids = []
        for label in labels:
            if label not in self.label2id_dict:
                label = "O"
            label_id = self.label2id_dict[label]
            label_ids.append(label_id)
        return label_ids
    def get_vars(self,str1):
        words = str1.strip()
        word_ids = self.word_to_ids(words)
        return word_ids
--- a/jieba/lac_small/tag.dic
+++ b/jieba/lac_small/tag.dic
@ -0,0 +1,57 @@
 0	a-B
 1	a-I
 2	ad-B
 3	ad-I
 4	an-B
 5	an-I
 6	c-B
 7	c-I
 8	d-B
 9	d-I
 10	f-B
 11	f-I
 12	m-B
 13	m-I
 14	n-B
 15	n-I
 16	nr-B
 17	nr-I
 18	ns-B
 19	ns-I
 20	nt-B
 21	nt-I
 22	nw-B
 23	nw-I
 24	nz-B
 25	nz-I
 26	p-B
 27	p-I
 28	q-B
 29	q-I
 30	r-B
 31	r-I
 32	s-B
 33	s-I
 34	t-B
 35	t-I
 36	u-B
 37	u-I
 38	v-B
 39	v-I
 40	vd-B
 41	vd-I
 42	vn-B
 43	vn-I
 44	w-B
 45	w-I
 46	xc-B
 47	xc-I
 48	PER-B
 49	PER-I
 50	LOC-B
 51	LOC-I
 52	ORG-B
 53	ORG-I
 54	TIME-B
 55	TIME-I
 56	O
--- a/jieba/lac_small/utils.py
+++ b/jieba/lac_small/utils.py
@ -0,0 +1,142 @@
 #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 util tools
 """
 from __future__ import print_function
 import os
 import sys
 import numpy as np
 import paddle.fluid as fluid
 import io
 def str2bool(v):
    """
    argparse does not support True or False in python
    """
    return v.lower() in ("true", "t", "1")
 def parse_result(words, crf_decode, dataset):
    """ parse result """
    offset_list = (crf_decode.lod())[0]
    words = np.array(words)
    crf_decode = np.array(crf_decode)
    batch_size = len(offset_list) - 1
    for sent_index in range(batch_size):
        begin, end = offset_list[sent_index], offset_list[sent_index + 1]
        sent=[]
        for id in words[begin:end]:
            if dataset.id2word_dict[str(id[0])]=='OOV':
                sent.append(' ')
            else:
                sent.append(dataset.id2word_dict[str(id[0])])
        tags = [
            dataset.id2label_dict[str(id[0])] for id in crf_decode[begin:end]
        ]
        sent_out = []
        tags_out = []
        parital_word = ""
        for ind, tag in enumerate(tags):
            # for the first word
            if parital_word == "":
                parital_word = sent[ind]
                tags_out.append(tag.split('-')[0])
                continue
            # for the beginning of word
            if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
                sent_out.append(parital_word)
                tags_out.append(tag.split('-')[0])
                parital_word = sent[ind]
                continue
            parital_word += sent[ind]
        # append the last word, except for len(tags)=0
        if len(sent_out) < len(tags_out):
            sent_out.append(parital_word)
    return sent_out,tags_out
 def parse_padding_result(words, crf_decode, seq_lens, dataset):
    """ parse padding result """
    words = np.squeeze(words)
    batch_size = len(seq_lens)
    batch_out = []
    for sent_index in range(batch_size):
        sent=[]
        for id in words[begin:end]:
            if dataset.id2word_dict[str(id[0])]=='OOV':
                sent.append(' ')
            else:
                sent.append(dataset.id2word_dict[str(id[0])])
        tags = [
            dataset.id2label_dict[str(id)]
            for id in crf_decode[sent_index][1:seq_lens[sent_index] - 1]
        ]
        sent_out = []
        tags_out = []
        parital_word = ""
        for ind, tag in enumerate(tags):
            # for the first word
            if parital_word == "":
                parital_word = sent[ind]
                tags_out.append(tag.split('-')[0])
                continue
            # for the beginning of word
            if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
                sent_out.append(parital_word)
                tags_out.append(tag.split('-')[0])
                parital_word = sent[ind]
                continue
            parital_word += sent[ind]
        # append the last word, except for len(tags)=0
        if len(sent_out) < len(tags_out):
            sent_out.append(parital_word)
        batch_out.append([sent_out, tags_out])
    return batch_out
 def init_checkpoint(exe, init_checkpoint_path, main_program):
    """
    Init CheckPoint
    """
    assert os.path.exists(
        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
    def existed_persitables(var):
        """
        If existed presitabels
        """
        if not fluid.io.is_persistable(var):
            return False
        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
    fluid.io.load_vars(
        exe,
        init_checkpoint_path,
        main_program=main_program,
        predicate=existed_persitables)
--- a/jieba/lac_small/word.dic
+++ b/jieba/lac_small/word.dic
--- a/jieba/posseg/init.py
+++ b/jieba/posseg/init.py
@ -1,11 +1,11 @@
 from __future__ import absolute_import, unicode_literals
-import os
+
 import re
 import sys
 import jieba
 import pickle
-from .._compat import *
+import re
 import jieba
 from .viterbi import viterbi
 from .._compat import *
 PROB_START_P = "prob_start.p"
 PROB_TRANS_P = "prob_trans.p"
@ -252,6 +252,7 @@ class POSTokenizer(object):
    def lcut(self, *args, **kwargs):
        return list(self.cut(*args, **kwargs))
 # default Tokenizer instance
 dt = POSTokenizer(jieba.dt)
@ -269,13 +270,25 @@ def _lcut_internal_no_hmm(s):
    return dt._lcut_internal_no_hmm(s)
-def cut(sentence, HMM=True):
+def cut(sentence, HMM=True, use_paddle=False):
    """
    Global `cut` function that supports parallel processing.
    Note that this only works using dt, custom POSTokenizer
    instances are not supported.
    """
    is_paddle_installed = check_paddle_install['is_paddle_installed']
    if use_paddle and is_paddle_installed:
        # if sentence is null, it will raise core exception in paddle.
        if sentence is None or sentence == "" or sentence == u"":
            return
        import jieba.lac_small.predict as predict
        sents, tags = predict.get_result(strdecode(sentence))
        for i, sent in enumerate(sents):
            if sent is None or tags[i] is None:
                continue
            yield pair(sent, tags[i])
        return
    global dt
    if jieba.pool is None:
        for w in dt.cut(sentence, HMM=HMM):
@ -291,5 +304,7 @@ def cut(sentence, HMM=True):
                yield w
-def lcut(sentence, HMM=True):
+def lcut(sentence, HMM=True, use_paddle=False):
    if use_paddle:
        return list(cut(sentence, use_paddle=True))
    return list(cut(sentence, HMM))
--- a/setup.py
+++ b/setup.py
@ -43,8 +43,8 @@ GitHub: https://github.com/fxsjy/jieba
 """
 setup(name='jieba',
-      version='0.39',
+      version='0.42.1',
-      description='Chinese Words Segementation Utilities',
+      description='Chinese Words Segmentation Utilities',
      long_description=LONGDOC,
      author='Sun, Junyi',
      author_email='ccnusjy@gmail.com',
@ -71,5 +71,5 @@ setup(name='jieba',
      keywords='NLP,tokenizing,Chinese word segementation',
      packages=['jieba'],
      package_dir={'jieba':'jieba'},
-      package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*']}
+      package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*', 'lac_small/*.py','lac_small/*.dic', 'lac_small/model_baseline/*']}
 )
--- a/test/test_cutall.py
+++ b/test/test_cutall.py
@ -96,3 +96,6 @@ if __name__ == "__main__":
    cuttest('AT&T是一件不错的公司，给你发offer了吗？')
    cuttest('C++和c#是什么关系？11+122=133，是吗？PI=3.14159')
    cuttest('你认识那个和主席握手的的哥吗？他开一辆黑色的士。')
    jieba.add_word('超敏C反应蛋白')
    cuttest('超敏C反应蛋白是什么, java好学吗?,小潘老板都学Python')
    cuttest('steel健身爆发力运动兴奋补充剂')
--- a/test/test_paddle.py
+++ b/test/test_paddle.py
@ -0,0 +1,102 @@
 #encoding=utf-8
 import sys
 sys.path.append("../")
 import jieba
 jieba.enable_paddle()
 def cuttest(test_sent):
    result = jieba.cut(test_sent, use_paddle=True)
    print(" / ".join(result))
 if __name__ == "__main__":
    cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空，我爱北京，我爱Python和C++。")
    cuttest("我不喜欢日本和服。")
    cuttest("雷猴回归人间。")
    cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
    cuttest("我需要廉租房")
    cuttest("永和服装饰品有限公司")
    cuttest("我爱北京天安门")
    cuttest("abc")
    cuttest("隐马尔可夫")
    cuttest("雷猴是个好网站")
    cuttest("“Microsoft”一词由“MICROcomputer（微型计算机）”和“SOFTware（软件）”两部分组成")
    cuttest("草泥马和欺实马是今年的流行词汇")
    cuttest("伊藤洋华堂总府店")
    cuttest("中国科学院计算技术研究所")
    cuttest("罗密欧与朱丽叶")
    cuttest("我购买了道具和服装")
    cuttest("PS: 我觉得开源有一个好处，就是能够敦促自己不断改进，避免敞帚自珍")
    cuttest("湖北省石首市")
    cuttest("湖北省十堰市")
    cuttest("总经理完成了这件事情")
    cuttest("电脑修好了")
    cuttest("做好了这件事情就一了百了了")
    cuttest("人们审美的观点是不同的")
    cuttest("我们买了一个美的空调")
    cuttest("线程初始化时我们要注意")
    cuttest("一个分子是由好多原子组织成的")
    cuttest("祝你马到功成")
    cuttest("他掉进了无底洞里")
    cuttest("中国的首都是北京")
    cuttest("孙君意")
    cuttest("外交部发言人马朝旭")
    cuttest("领导人会议和第四届东亚峰会")
    cuttest("在过去的这五年")
    cuttest("还需要很长的路要走")
    cuttest("60周年首都阅兵")
    cuttest("你好人们审美的观点是不同的")
    cuttest("买水果然后来世博园")
    cuttest("买水果然后去世博园")
    cuttest("但是后来我才知道你是对的")
    cuttest("存在即合理")
    cuttest("的的的的的在的的的的就以和和和")
    cuttest("I love你，不以为耻，反以为rong")
    cuttest("因")
    cuttest("")
    cuttest("hello你好人们审美的观点是不同的")
    cuttest("很好但主要是基于网页形式")
    cuttest("hello你好人们审美的观点是不同的")
    cuttest("为什么我不能拥有想要的生活")
    cuttest("后来我才")
    cuttest("此次来中国是为了")
    cuttest("使用了它就可以解决一些问题")
    cuttest(",使用了它就可以解决一些问题")
    cuttest("其实使用了它就可以解决一些问题")
    cuttest("好人使用了它就可以解决一些问题")
    cuttest("是因为和国家")
    cuttest("老年搜索还支持")
    cuttest("干脆就把那部蒙人的闲法给废了拉倒！RT @laoshipukong : 27日，全国人大常委会第三次审议侵权责任法草案，删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
    cuttest("大")
    cuttest("")
    cuttest("他说的确实在理")
    cuttest("长春市长春节讲话")
    cuttest("结婚的和尚未结婚的")
    cuttest("结合成分子时")
    cuttest("旅游和服务是最好的")
    cuttest("这件事情的确是我的错")
    cuttest("供大家参考指正")
    cuttest("哈尔滨政府公布塌桥原因")
    cuttest("我在机场入口处")
    cuttest("邢永臣摄影报道")
    cuttest("BP神经网络如何训练才能在分类时增加区分度？")
    cuttest("南京市长江大桥")
    cuttest("应一些使用者的建议，也为了便于利用NiuTrans用于SMT研究")
    cuttest('长春市长春药店')
    cuttest('邓颖超生前最喜欢的衣服')
    cuttest('胡锦涛是热爱世界和平的政治局常委')
    cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
    cuttest('一次性交多少钱')
    cuttest('两块五一套，三块八一斤，四块七一本，五块六一条')
    cuttest('小和尚留了一个像大和尚一样的和尚头')
    cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
    cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
    cuttest('AT&T是一件不错的公司，给你发offer了吗？')
    cuttest('C++和c#是什么关系？11+122=133，是吗？PI=3.14159')
    cuttest('你认识那个和主席握手的的哥吗？他开一辆黑色的士。')
    cuttest('枪杆子中出政权')
    cuttest('张三风同学走上了不归路')
    cuttest('阿Q腰间挂着BB机手里拿着大哥大，说：我一般吃饭不AA制的。')
    cuttest('在1号店能买到小S和大S八卦的书，还有3D电视。')
    jieba.del_word('很赞')
    cuttest('看上去iphone8手机样式很赞,售价699美元,销量涨了5%么？')
--- a/test/test_paddle_postag.py
+++ b/test/test_paddle_postag.py
@ -0,0 +1,102 @@
 #encoding=utf-8
 import sys
 sys.path.append("../")
 import jieba.posseg as pseg
 import jieba
 jieba.enable_paddle()
 def cuttest(test_sent):
    result = pseg.cut(test_sent, use_paddle=True)
    for word, flag in result:
        print('%s %s' % (word, flag))
 if __name__ == "__main__":
    cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空，我爱北京，我爱Python和C++。")
    cuttest("我不喜欢日本和服。")
    cuttest("雷猴回归人间。")
    cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
    cuttest("我需要廉租房")
    cuttest("永和服装饰品有限公司")
    cuttest("我爱北京天安门")
    cuttest("abc")
    cuttest("隐马尔可夫")
    cuttest("雷猴是个好网站")
    cuttest("“Microsoft”一词由“MICROcomputer（微型计算机）”和“SOFTware（软件）”两部分组成")
    cuttest("草泥马和欺实马是今年的流行词汇")
    cuttest("伊藤洋华堂总府店")
    cuttest("中国科学院计算技术研究所")
    cuttest("罗密欧与朱丽叶")
    cuttest("我购买了道具和服装")
    cuttest("PS: 我觉得开源有一个好处，就是能够敦促自己不断改进，避免敞帚自珍")
    cuttest("湖北省石首市")
    cuttest("湖北省十堰市")
    cuttest("总经理完成了这件事情")
    cuttest("电脑修好了")
    cuttest("做好了这件事情就一了百了了")
    cuttest("人们审美的观点是不同的")
    cuttest("我们买了一个美的空调")
    cuttest("线程初始化时我们要注意")
    cuttest("一个分子是由好多原子组织成的")
    cuttest("祝你马到功成")
    cuttest("他掉进了无底洞里")
    cuttest("中国的首都是北京")
    cuttest("孙君意")
    cuttest("外交部发言人马朝旭")
    cuttest("领导人会议和第四届东亚峰会")
    cuttest("在过去的这五年")
    cuttest("还需要很长的路要走")
    cuttest("60周年首都阅兵")
    cuttest("你好人们审美的观点是不同的")
    cuttest("买水果然后来世博园")
    cuttest("买水果然后去世博园")
    cuttest("但是后来我才知道你是对的")
    cuttest("存在即合理")
    cuttest("的的的的的在的的的的就以和和和")
    cuttest("I love你，不以为耻，反以为rong")
    cuttest("因")
    cuttest("")
    cuttest("hello你好人们审美的观点是不同的")
    cuttest("很好但主要是基于网页形式")
    cuttest("hello你好人们审美的观点是不同的")
    cuttest("为什么我不能拥有想要的生活")
    cuttest("后来我才")
    cuttest("此次来中国是为了")
    cuttest("使用了它就可以解决一些问题")
    cuttest(",使用了它就可以解决一些问题")
    cuttest("其实使用了它就可以解决一些问题")
    cuttest("好人使用了它就可以解决一些问题")
    cuttest("是因为和国家")
    cuttest("老年搜索还支持")
    cuttest("干脆就把那部蒙人的闲法给废了拉倒！RT @laoshipukong : 27日，全国人大常委会第三次审议侵权责任法草案，删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
    cuttest("大")
    cuttest("")
    cuttest("他说的确实在理")
    cuttest("长春市长春节讲话")
    cuttest("结婚的和尚未结婚的")
    cuttest("结合成分子时")
    cuttest("旅游和服务是最好的")
    cuttest("这件事情的确是我的错")
    cuttest("供大家参考指正")
    cuttest("哈尔滨政府公布塌桥原因")
    cuttest("我在机场入口处")
    cuttest("邢永臣摄影报道")
    cuttest("BP神经网络如何训练才能在分类时增加区分度？")
    cuttest("南京市长江大桥")
    cuttest("应一些使用者的建议，也为了便于利用NiuTrans用于SMT研究")
    cuttest('长春市长春药店')
    cuttest('邓颖超生前最喜欢的衣服')
    cuttest('胡锦涛是热爱世界和平的政治局常委')
    cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
    cuttest('一次性交多少钱')
    cuttest('两块五一套，三块八一斤，四块七一本，五块六一条')
    cuttest('小和尚留了一个像大和尚一样的和尚头')
    cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
    cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
    cuttest('AT&T是一件不错的公司，给你发offer了吗？')
    cuttest('C++和c#是什么关系？11+122=133，是吗？PI=3.14159')
    cuttest('你认识那个和主席握手的的哥吗？他开一辆黑色的士。')
    cuttest('枪杆子中出政权')
    cuttest('张三风同学走上了不归路')
    cuttest('阿Q腰间挂着BB机手里拿着大哥大，说：我一般吃饭不AA制的。')
    cuttest('在1号店能买到小S和大S八卦的书，还有3D电视。')
--- a/test/test_whoosh.py
+++ b/test/test_whoosh.py
@ -6,7 +6,7 @@ from whoosh.index import create_in,open_dir
 from whoosh.fields import *
 from whoosh.qparser import QueryParser
-from jieba.analyse import ChineseAnalyzer
+from jieba.analyse.analyzer import ChineseAnalyzer
 analyzer = ChineseAnalyzer()
Author	SHA1	Message	Date
Neutrino	67fa2e36e7	Update README.md update paddle link. (#817 )	2020-02-15 16:33:35 +08:00
fxsjy	1e20c89b66	fix setup.py in python2.7	2020-01-20 22:22:34 +08:00
fxsjy	5704e23bbf	update version: 0.42	2020-01-13 21:24:45 +08:00
fxsjy	aa65031788	fix file mode	2020-01-13 21:03:38 +08:00
fxsjy	2eb11c8028	fix issue #810	2020-01-13 20:53:43 +08:00
JesseyXujin	d703bce302	paddle coredump exception fix (#807 ) * paddle_null_point_fix * add core expception note * delete yield * modify test paddle for supporting enable_paddle()	2020-01-10 16:30:46 +08:00
vissssa	dc2b788eb3	refactor: improvement check_paddle_installed (#806 )	2020-01-09 19:23:11 +08:00
fxsjy	0868c323d9	update version in __init__.py	2020-01-08 16:21:07 +08:00
fxsjy	eb37e048da	update version to 0.41	2020-01-08 16:04:30 +08:00
JesseyXujin	381b0691ac	Add enable_paddle interface to install paddle and import packages (#802 ) * enable_paddle_interface * Add enable_paddle interface to install paddle and import packages * Add enable_paddle interface to install paddle and import packages * add posseg lcut for paddle mode * fix vocabulary	2020-01-08 15:26:12 +08:00
fxsjy	97c32464e1	fix issue #798	2020-01-03 14:10:48 +08:00
Tim Gates	0489a6979e	Fix simple typo: vocabuary -> vocabulary (#797 ) Closes #796	2020-01-02 10:26:00 +08:00
JesseyXujin	30ea8f929e	Simplify Paddle import check (#795 )	2019-12-31 15:03:14 +08:00
JesseyXujin	0b74b6c2de	add jieba upgrade not in README.md and change import imp to import importlib in _compat.py (#794 )	2019-12-31 14:14:50 +08:00
Sun Junyi	2fdee89883	Update README.md	2019-12-30 17:11:22 +08:00
JesseyXujin	17bab6a2d1	修改paddle版本检测报错机制 (#790 )	2019-12-25 19:46:49 +08:00
Sun Junyi	80947ff843	Update Changelog	2019-12-25 10:49:02 +08:00
fxsjy	68ce6955b7	update version to 0.40	2019-12-25 10:35:22 +08:00
fxsjy	d47e14e5b3	update version	2019-12-25 10:34:18 +08:00
pkpk	27910094ac	Fix bugs in Paddle seg and Paddle postag (#789 ) * fix bugs in paddle seg and paddle postag * fix compat in checking paddle	2019-12-24 21:02:55 +08:00
Sun Junyi	9dc8e6d992	Update README.md	2019-12-24 19:29:17 +08:00
fxsjy	478c3b9bb4	lazy import paddle	2019-12-24 19:19:51 +08:00
JesseyXujin	5b3bb4b7f2	加入paddle分词和词性标注功能 (#788 ) * paddle cut release * 修改README.md，提示用户安装paddlepaddle.tiny * 删除两个init.py文件中utf头文件 * 修改readme细节	2019-12-24 17:27:41 +08:00
Hongxiang Lin	38134ee20f	修复suggest_freq中add_word指向的bug (#723 )	2019-07-01 19:43:45 +08:00
Paul Meng	3645a5bb5d	Update README.md (#745 )	2019-07-01 19:41:47 +08:00
Sun Junyi	8212b6c572	Update README.md	2018-12-03 16:29:32 +08:00
Sun Junyi	843cdc2b7c	Merge pull request #582 from hosiet/pr-fix-typo-codespell Fix typos found by codespell	2018-09-20 10:44:47 +08:00
Sun Junyi	68f2a64f7e	Merge pull request #663 from JimCurryWang/patch-1 Fix __init__ "-" symbol issue	2018-09-20 10:40:35 +08:00
Sun Junyi	4c8479cfa6	Merge pull request #667 from ZhengZixiang/patch-1 fix the error about importing ChineseAnalyzer	2018-09-20 10:39:29 +08:00
imzhengzx	ca444fb4da	fix the error about imoprting ChineseAnalyzer Because of the interface change about ChineseAnlayzer , the code 'from jieba.analyse import Chinese Analyzer' in this test file would report an ImportError like 'cannot import name 'ChineseAnalyzer'. Just change import code to 'from jieba.analyse.analyzer import ChineseAnalyzer' can fix it.	2018-09-15 11:59:01 +08:00
CY Wang	36a27302ce	Fix __init__ "-" symbol issue Solving "-" symbol can't be analyze issue . For example, In keyword , chap-EX喬沛詩 , SK-II ...etc the present version will show "chap", "-", "EX喬沛詩" , "SK", "-", "II" After the modify, The new version will show "chap-EX","喬沛詩" , "SK-II" ps: I have used the jieba.load_userdict() , and added "chap-EX" , "喬沛詩", "SK-II" in the userdict.txt.	2018-08-27 17:05:46 +08:00
Sun Junyi	7653db2e33	Update README.md	2018-07-04 17:18:02 +08:00
Boyuan Yang	17ef8abba3	Fix typos found by codespell	2018-01-21 19:15:48 +08:00
+	a-B
+	a-I
+	ad-B
+	ad-I
+	an-B
+	an-I
+	c-B
+	c-I
+	d-B
+	d-I
+	f-B
+	f-I
+	m-B
+	m-I
+	n-B
+	n-I
+	nr-B
+	nr-I
+	ns-B
+	ns-I
+	nt-B
+	nt-I
+	nw-B
+	nw-I
+	nz-B
+	nz-I
+	p-B
+	p-I
+	q-B
+	q-I
+	r-B
+	r-I
+	s-B
+	s-I
+	t-B
+	t-I
+	u-B
+	u-I
+	v-B
+	v-I
+	vd-B
+	vd-I
+	vn-B
+	vn-I
+	w-B
+	w-I
+	xc-B
+	xc-I
+	PER-B
+	PER-I
+	LOC-B
+	LOC-I
+	ORG-B
+	ORG-I
+	TIME-B
+	TIME-I
+	O