mirror of
https://github.com/fxsjy/jieba.git
synced 2025-07-10 00:01:33 +08:00
Compare commits
33 Commits
Author | SHA1 | Date | |
---|---|---|---|
|
67fa2e36e7 | ||
|
1e20c89b66 | ||
|
5704e23bbf | ||
|
aa65031788 | ||
|
2eb11c8028 | ||
|
d703bce302 | ||
|
dc2b788eb3 | ||
|
0868c323d9 | ||
|
eb37e048da | ||
|
381b0691ac | ||
|
97c32464e1 | ||
|
0489a6979e | ||
|
30ea8f929e | ||
|
0b74b6c2de | ||
|
2fdee89883 | ||
|
17bab6a2d1 | ||
|
80947ff843 | ||
|
68ce6955b7 | ||
|
d47e14e5b3 | ||
|
27910094ac | ||
|
9dc8e6d992 | ||
|
478c3b9bb4 | ||
|
5b3bb4b7f2 | ||
|
38134ee20f | ||
|
3645a5bb5d | ||
|
8212b6c572 | ||
|
843cdc2b7c | ||
|
68f2a64f7e | ||
|
4c8479cfa6 | ||
|
ca444fb4da | ||
|
36a27302ce | ||
|
7653db2e33 | ||
|
17ef8abba3 |
18
Changelog
18
Changelog
@ -1,3 +1,21 @@
|
||||
2019-1-20: version 0.42.1
|
||||
1. 修复setup.py在python2.7版本无法工作的问题 (issue #809)
|
||||
|
||||
2019-1-13: version 0.42
|
||||
1. 修复paddle模式空字符串coredump问题 @JesseyXujin
|
||||
2. 修复cut_all模式切分丢字问题 @fxsjy
|
||||
3. paddle安装检测优化 @vissssa
|
||||
|
||||
2019-1-8: version 0.41
|
||||
1. 开启paddle模式更友好
|
||||
2. 修复cut_all模式不支持中英混合词的bug
|
||||
|
||||
2019-12-25: version 0.40
|
||||
1. 支持基于paddle的深度学习分词模式(use_paddle=True); by @JesseyXujin, @xyzhou-puck
|
||||
2. 修复自定义Tokenizer实例的add_word方法指向全局的问题; by @linhx13
|
||||
3. 修复whoosh测试用例的引用bug; by @ZhengZixiang
|
||||
4. 修复自定义词库不支持含"-"符号的问题;by @JimCurryWang
|
||||
|
||||
2017-08-28: version 0.39
|
||||
1. del_word支持强行拆开词语; by @gumblex,@fxsjy
|
||||
2. 修复百分数的切词; by @fxsjy
|
||||
|
59
README.md
59
README.md
@ -9,24 +9,15 @@ jieba
|
||||
|
||||
特点
|
||||
========
|
||||
* 支持三种分词模式:
|
||||
* 支持四种分词模式:
|
||||
* 精确模式,试图将句子最精确地切开,适合文本分析;
|
||||
* 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
|
||||
* 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
|
||||
|
||||
* paddle模式,利用PaddlePaddle深度学习框架,训练序列标注(双向GRU)网络模型实现分词。同时支持词性标注。paddle模式使用需安装paddlepaddle-tiny,`pip install paddlepaddle-tiny==1.6.1`。目前paddle模式支持jieba v0.40及以上版本。jieba v0.40以下版本,请升级jieba,`pip install jieba --upgrade` 。[PaddlePaddle官网](https://www.paddlepaddle.org.cn/)
|
||||
* 支持繁体分词
|
||||
* 支持自定义词典
|
||||
* MIT 授权协议
|
||||
|
||||
在线演示
|
||||
=========
|
||||
http://jiebademo.ap01.aws.af.cm/
|
||||
|
||||
(Powered by Appfog)
|
||||
|
||||
网站代码:https://github.com/fxsjy/jiebademo
|
||||
|
||||
|
||||
安装说明
|
||||
=======
|
||||
|
||||
@ -36,6 +27,7 @@ http://jiebademo.ap01.aws.af.cm/
|
||||
* 半自动安装:先下载 http://pypi.python.org/pypi/jieba/ ,解压后运行 `python setup.py install`
|
||||
* 手动安装:将 jieba 目录放置于当前目录或者 site-packages 目录
|
||||
* 通过 `import jieba` 来引用
|
||||
* 如果需要使用paddle模式下的分词和词性标注功能,请先安装paddlepaddle-tiny,`pip install paddlepaddle-tiny==1.6.1`。
|
||||
|
||||
算法
|
||||
========
|
||||
@ -47,7 +39,7 @@ http://jiebademo.ap01.aws.af.cm/
|
||||
=======
|
||||
1. 分词
|
||||
--------
|
||||
* `jieba.cut` 方法接受三个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用 HMM 模型
|
||||
* `jieba.cut` 方法接受四个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用 HMM 模型;use_paddle 参数用来控制是否使用paddle模式下的分词模式,paddle模式采用延迟加载方式,通过enable_paddle接口安装paddlepaddle-tiny,并且import相关代码;
|
||||
* `jieba.cut_for_search` 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
|
||||
* 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意:不建议直接输入 GBK 字符串,可能无法预料地错误解码成 UTF-8
|
||||
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator,可以使用 for 循环来获得分词后得到的每一个词语(unicode),或者用
|
||||
@ -60,6 +52,12 @@ http://jiebademo.ap01.aws.af.cm/
|
||||
# encoding=utf-8
|
||||
import jieba
|
||||
|
||||
jieba.enable_paddle()# 启动paddle模式。 0.40版之后开始支持,早期版本不支持
|
||||
strs=["我来到北京清华大学","乒乓球拍卖完了","中国科学技术大学"]
|
||||
for str in strs:
|
||||
seg_list = jieba.cut(str,use_paddle=True) # 使用paddle模式
|
||||
print("Paddle Mode: " + '/'.join(list(seg_list)))
|
||||
|
||||
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
|
||||
print("Full Mode: " + "/ ".join(seg_list)) # 全模式
|
||||
|
||||
@ -195,11 +193,15 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
|
||||
-----------
|
||||
* `jieba.posseg.POSTokenizer(tokenizer=None)` 新建自定义分词器,`tokenizer` 参数可指定内部使用的 `jieba.Tokenizer` 分词器。`jieba.posseg.dt` 为默认词性标注分词器。
|
||||
* 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法。
|
||||
* 除了jieba默认分词模式,提供paddle模式下的词性标注功能。paddle模式采用延迟加载方式,通过enable_paddle()安装paddlepaddle-tiny,并且import相关代码;
|
||||
* 用法示例
|
||||
|
||||
```pycon
|
||||
>>> import jieba
|
||||
>>> import jieba.posseg as pseg
|
||||
>>> words = pseg.cut("我爱北京天安门")
|
||||
>>> words = pseg.cut("我爱北京天安门") #jieba默认模式
|
||||
>>> jieba.enable_paddle() #启动paddle模式。 0.40版之后开始支持,早期版本不支持
|
||||
>>> words = pseg.cut("我爱北京天安门",use_paddle=True) #paddle模式
|
||||
>>> for word, flag in words:
|
||||
... print('%s %s' % (word, flag))
|
||||
...
|
||||
@ -209,6 +211,21 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
|
||||
天安门 ns
|
||||
```
|
||||
|
||||
paddle模式词性标注对应表如下:
|
||||
|
||||
paddle模式词性和专名类别标签集合如下表,其中词性标签 24 个(小写字母),专名类别标签 4 个(大写字母)。
|
||||
|
||||
| 标签 | 含义 | 标签 | 含义 | 标签 | 含义 | 标签 | 含义 |
|
||||
| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
|
||||
| n | 普通名词 | f | 方位名词 | s | 处所名词 | t | 时间 |
|
||||
| nr | 人名 | ns | 地名 | nt | 机构名 | nw | 作品名 |
|
||||
| nz | 其他专名 | v | 普通动词 | vd | 动副词 | vn | 名动词 |
|
||||
| a | 形容词 | ad | 副形词 | an | 名形词 | d | 副词 |
|
||||
| m | 数量词 | q | 量词 | r | 代词 | p | 介词 |
|
||||
| c | 连词 | u | 助词 | xc | 其他虚词 | w | 标点符号 |
|
||||
| PER | 人名 | LOC | 地名 | ORG | 机构名 | TIME | 时间 |
|
||||
|
||||
|
||||
5. 并行分词
|
||||
-----------
|
||||
* 原理:将目标文本按行分隔后,把各行文本分配到多个 Python 进程并行分词,然后归并结果,从而获得分词速度的可观提升
|
||||
@ -362,6 +379,11 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
|
||||
作者:yanyiwu
|
||||
地址:https://github.com/yanyiwu/cppjieba
|
||||
|
||||
结巴分词 Rust 版本
|
||||
----------------
|
||||
作者:messense, MnO2
|
||||
地址:https://github.com/messense/jieba-rs
|
||||
|
||||
结巴分词 Node.js 版本
|
||||
----------------
|
||||
作者:yanyiwu
|
||||
@ -398,6 +420,17 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
|
||||
+ 作者: wangbin 地址: https://github.com/wangbin/jiebago
|
||||
+ 作者: yanyiwu 地址: https://github.com/yanyiwu/gojieba
|
||||
|
||||
结巴分词Android版本
|
||||
------------------
|
||||
+ 作者 Dongliang.W 地址:https://github.com/452896915/jieba-android
|
||||
|
||||
|
||||
友情链接
|
||||
=========
|
||||
* https://github.com/baidu/lac 百度中文词法分析(分词+词性+专名)系统
|
||||
* https://github.com/baidu/AnyQ 百度FAQ自动问答系统
|
||||
* https://github.com/baidu/Senta 百度情感识别系统
|
||||
|
||||
系统集成
|
||||
========
|
||||
1. Solr: https://github.com/sing1ee/jieba-solr
|
||||
|
@ -1,19 +1,18 @@
|
||||
from __future__ import absolute_import, unicode_literals
|
||||
__version__ = '0.39'
|
||||
|
||||
__version__ = '0.42.1'
|
||||
__license__ = 'MIT'
|
||||
|
||||
import re
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import logging
|
||||
import marshal
|
||||
import re
|
||||
import tempfile
|
||||
import threading
|
||||
from math import log
|
||||
import time
|
||||
from hashlib import md5
|
||||
from ._compat import *
|
||||
from math import log
|
||||
|
||||
from . import finalseg
|
||||
from ._compat import *
|
||||
|
||||
if os.name == 'nt':
|
||||
from shutil import move as _replace_file
|
||||
@ -40,15 +39,17 @@ re_eng = re.compile('[a-zA-Z0-9]', re.U)
|
||||
|
||||
# \u4E00-\u9FD5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
|
||||
# \r\n|\s : whitespace characters. Will not be handled.
|
||||
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
|
||||
# re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
|
||||
# Adding "-" symbol in re_han_default
|
||||
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
|
||||
|
||||
re_skip_default = re.compile("(\r\n|\s)", re.U)
|
||||
re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)
|
||||
re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)
|
||||
|
||||
|
||||
def setLogLevel(log_level):
|
||||
global logger
|
||||
default_logger.setLevel(log_level)
|
||||
|
||||
|
||||
class Tokenizer(object):
|
||||
|
||||
def __init__(self, dictionary=DEFAULT_DICT):
|
||||
@ -67,7 +68,8 @@ class Tokenizer(object):
|
||||
def __repr__(self):
|
||||
return '<Tokenizer dictionary=%r>' % self.dictionary
|
||||
|
||||
def gen_pfdict(self, f):
|
||||
@staticmethod
|
||||
def gen_pfdict(f):
|
||||
lfreq = {}
|
||||
ltotal = 0
|
||||
f_name = resolve_filename(f)
|
||||
@ -161,7 +163,7 @@ class Tokenizer(object):
|
||||
self.initialized = True
|
||||
default_logger.debug(
|
||||
"Loading model cost %.3f seconds." % (time.time() - t1))
|
||||
default_logger.debug("Prefix dict has been built succesfully.")
|
||||
default_logger.debug("Prefix dict has been built successfully.")
|
||||
|
||||
def check_initialized(self):
|
||||
if not self.initialized:
|
||||
@ -196,15 +198,30 @@ class Tokenizer(object):
|
||||
def __cut_all(self, sentence):
|
||||
dag = self.get_DAG(sentence)
|
||||
old_j = -1
|
||||
eng_scan = 0
|
||||
eng_buf = u''
|
||||
for k, L in iteritems(dag):
|
||||
if eng_scan == 1 and not re_eng.match(sentence[k]):
|
||||
eng_scan = 0
|
||||
yield eng_buf
|
||||
if len(L) == 1 and k > old_j:
|
||||
yield sentence[k:L[0] + 1]
|
||||
word = sentence[k:L[0] + 1]
|
||||
if re_eng.match(word):
|
||||
if eng_scan == 0:
|
||||
eng_scan = 1
|
||||
eng_buf = word
|
||||
else:
|
||||
eng_buf += word
|
||||
if eng_scan == 0:
|
||||
yield word
|
||||
old_j = L[0]
|
||||
else:
|
||||
for j in L:
|
||||
if j > k:
|
||||
yield sentence[k:j + 1]
|
||||
old_j = j
|
||||
if eng_scan == 1:
|
||||
yield eng_buf
|
||||
|
||||
def __cut_DAG_NO_HMM(self, sentence):
|
||||
DAG = self.get_DAG(sentence)
|
||||
@ -269,22 +286,29 @@ class Tokenizer(object):
|
||||
for elem in buf:
|
||||
yield elem
|
||||
|
||||
def cut(self, sentence, cut_all=False, HMM=True):
|
||||
'''
|
||||
def cut(self, sentence, cut_all=False, HMM=True, use_paddle=False):
|
||||
"""
|
||||
The main function that segments an entire sentence that contains
|
||||
Chinese characters into seperated words.
|
||||
Chinese characters into separated words.
|
||||
|
||||
Parameter:
|
||||
- sentence: The str(unicode) to be segmented.
|
||||
- cut_all: Model type. True for full pattern, False for accurate pattern.
|
||||
- HMM: Whether to use the Hidden Markov Model.
|
||||
'''
|
||||
"""
|
||||
is_paddle_installed = check_paddle_install['is_paddle_installed']
|
||||
sentence = strdecode(sentence)
|
||||
|
||||
if cut_all:
|
||||
re_han = re_han_cut_all
|
||||
re_skip = re_skip_cut_all
|
||||
else:
|
||||
if use_paddle and is_paddle_installed:
|
||||
# if sentence is null, it will raise core exception in paddle.
|
||||
if sentence is None or len(sentence) == 0:
|
||||
return
|
||||
import jieba.lac_small.predict as predict
|
||||
results = predict.get_sent(sentence)
|
||||
for sent in results:
|
||||
if sent is None:
|
||||
continue
|
||||
yield sent
|
||||
return
|
||||
re_han = re_han_default
|
||||
re_skip = re_skip_default
|
||||
if cut_all:
|
||||
@ -446,7 +470,7 @@ class Tokenizer(object):
|
||||
freq *= self.FREQ.get(seg, 1) / ftotal
|
||||
freq = min(int(freq * self.total), self.FREQ.get(word, 0))
|
||||
if tune:
|
||||
add_word(word, freq)
|
||||
self.add_word(word, freq)
|
||||
return freq
|
||||
|
||||
def tokenize(self, unicode_sentence, mode="default", HMM=True):
|
||||
|
@ -1,15 +1,56 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
|
||||
log_console = logging.StreamHandler(sys.stderr)
|
||||
default_logger = logging.getLogger(__name__)
|
||||
default_logger.setLevel(logging.DEBUG)
|
||||
|
||||
|
||||
def setLogLevel(log_level):
|
||||
default_logger.setLevel(log_level)
|
||||
|
||||
|
||||
check_paddle_install = {'is_paddle_installed': False}
|
||||
|
||||
try:
|
||||
import pkg_resources
|
||||
|
||||
get_module_res = lambda *res: pkg_resources.resource_stream(__name__,
|
||||
os.path.join(*res))
|
||||
except ImportError:
|
||||
get_module_res = lambda *res: open(os.path.normpath(os.path.join(
|
||||
os.getcwd(), os.path.dirname(__file__), *res)), 'rb')
|
||||
|
||||
|
||||
def enable_paddle():
|
||||
try:
|
||||
import paddle
|
||||
except ImportError:
|
||||
default_logger.debug("Installing paddle-tiny, please wait a minute......")
|
||||
os.system("pip install paddlepaddle-tiny")
|
||||
try:
|
||||
import paddle
|
||||
except ImportError:
|
||||
default_logger.debug(
|
||||
"Import paddle error, please use command to install: pip install paddlepaddle-tiny==1.6.1."
|
||||
"Now, back to jieba basic cut......")
|
||||
if paddle.__version__ < '1.6.1':
|
||||
default_logger.debug("Find your own paddle version doesn't satisfy the minimum requirement (1.6.1), "
|
||||
"please install paddle tiny by 'pip install --upgrade paddlepaddle-tiny', "
|
||||
"or upgrade paddle full version by "
|
||||
"'pip install --upgrade paddlepaddle (-gpu for GPU version)' ")
|
||||
else:
|
||||
try:
|
||||
import jieba.lac_small.predict as predict
|
||||
default_logger.debug("Paddle enabled successfully......")
|
||||
check_paddle_install['is_paddle_installed'] = True
|
||||
except ImportError:
|
||||
default_logger.debug("Import error, cannot find paddle.fluid and jieba.lac_small.predict module. "
|
||||
"Now, back to jieba basic cut......")
|
||||
|
||||
|
||||
PY2 = sys.version_info[0] == 2
|
||||
|
||||
default_encoding = sys.getfilesystemencoding()
|
||||
@ -31,6 +72,7 @@ else:
|
||||
itervalues = lambda d: iter(d.values())
|
||||
iteritems = lambda d: iter(d.items())
|
||||
|
||||
|
||||
def strdecode(sentence):
|
||||
if not isinstance(sentence, text_type):
|
||||
try:
|
||||
@ -39,6 +81,7 @@ def strdecode(sentence):
|
||||
sentence = sentence.decode('gbk', 'ignore')
|
||||
return sentence
|
||||
|
||||
|
||||
def resolve_filename(f):
|
||||
try:
|
||||
return f.name
|
||||
|
0
jieba/lac_small/__init__.py
Normal file
0
jieba/lac_small/__init__.py
Normal file
46
jieba/lac_small/creator.py
Normal file
46
jieba/lac_small/creator.py
Normal file
@ -0,0 +1,46 @@
|
||||
# -*- coding: UTF-8 -*-
|
||||
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Define the function to create lexical analysis model and model's data reader
|
||||
"""
|
||||
import sys
|
||||
import os
|
||||
import math
|
||||
|
||||
import paddle
|
||||
import paddle.fluid as fluid
|
||||
from paddle.fluid.initializer import NormalInitializer
|
||||
import jieba.lac_small.nets as nets
|
||||
|
||||
|
||||
def create_model(vocab_size, num_labels, mode='train'):
|
||||
"""create lac model"""
|
||||
|
||||
# model's input data
|
||||
words = fluid.data(name='words', shape=[-1, 1], dtype='int64', lod_level=1)
|
||||
targets = fluid.data(
|
||||
name='targets', shape=[-1, 1], dtype='int64', lod_level=1)
|
||||
|
||||
# for inference process
|
||||
if mode == 'infer':
|
||||
crf_decode = nets.lex_net(
|
||||
words, vocab_size, num_labels, for_infer=True, target=None)
|
||||
return {
|
||||
"feed_list": [words],
|
||||
"words": words,
|
||||
"crf_decode": crf_decode,
|
||||
}
|
||||
return ret
|
||||
|
BIN
jieba/lac_small/model_baseline/crfw
Normal file
BIN
jieba/lac_small/model_baseline/crfw
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/fc_0.b_0
Normal file
BIN
jieba/lac_small/model_baseline/fc_0.b_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/fc_0.w_0
Normal file
BIN
jieba/lac_small/model_baseline/fc_0.w_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/fc_1.b_0
Normal file
BIN
jieba/lac_small/model_baseline/fc_1.b_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/fc_1.w_0
Normal file
BIN
jieba/lac_small/model_baseline/fc_1.w_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/fc_2.b_0
Normal file
BIN
jieba/lac_small/model_baseline/fc_2.b_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/fc_2.w_0
Normal file
BIN
jieba/lac_small/model_baseline/fc_2.w_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/fc_3.b_0
Normal file
BIN
jieba/lac_small/model_baseline/fc_3.b_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/fc_3.w_0
Normal file
BIN
jieba/lac_small/model_baseline/fc_3.w_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/fc_4.b_0
Normal file
BIN
jieba/lac_small/model_baseline/fc_4.b_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/fc_4.w_0
Normal file
BIN
jieba/lac_small/model_baseline/fc_4.w_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/gru_0.b_0
Normal file
BIN
jieba/lac_small/model_baseline/gru_0.b_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/gru_0.w_0
Normal file
BIN
jieba/lac_small/model_baseline/gru_0.w_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/gru_1.b_0
Normal file
BIN
jieba/lac_small/model_baseline/gru_1.b_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/gru_1.w_0
Normal file
BIN
jieba/lac_small/model_baseline/gru_1.w_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/gru_2.b_0
Normal file
BIN
jieba/lac_small/model_baseline/gru_2.b_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/gru_2.w_0
Normal file
BIN
jieba/lac_small/model_baseline/gru_2.w_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/gru_3.b_0
Normal file
BIN
jieba/lac_small/model_baseline/gru_3.b_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/gru_3.w_0
Normal file
BIN
jieba/lac_small/model_baseline/gru_3.w_0
Normal file
Binary file not shown.
BIN
jieba/lac_small/model_baseline/word_emb
Normal file
BIN
jieba/lac_small/model_baseline/word_emb
Normal file
Binary file not shown.
122
jieba/lac_small/nets.py
Normal file
122
jieba/lac_small/nets.py
Normal file
@ -0,0 +1,122 @@
|
||||
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
The function lex_net(args) define the lexical analysis network structure
|
||||
"""
|
||||
import sys
|
||||
import os
|
||||
import math
|
||||
|
||||
import paddle.fluid as fluid
|
||||
from paddle.fluid.initializer import NormalInitializer
|
||||
|
||||
|
||||
def lex_net(word, vocab_size, num_labels, for_infer=True, target=None):
|
||||
"""
|
||||
define the lexical analysis network structure
|
||||
word: stores the input of the model
|
||||
for_infer: a boolean value, indicating if the model to be created is for training or predicting.
|
||||
|
||||
return:
|
||||
for infer: return the prediction
|
||||
otherwise: return the prediction
|
||||
"""
|
||||
|
||||
word_emb_dim=128
|
||||
grnn_hidden_dim=128
|
||||
bigru_num=2
|
||||
emb_lr = 1.0
|
||||
crf_lr = 1.0
|
||||
init_bound = 0.1
|
||||
IS_SPARSE = True
|
||||
|
||||
def _bigru_layer(input_feature):
|
||||
"""
|
||||
define the bidirectional gru layer
|
||||
"""
|
||||
pre_gru = fluid.layers.fc(
|
||||
input=input_feature,
|
||||
size=grnn_hidden_dim * 3,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-init_bound, high=init_bound),
|
||||
regularizer=fluid.regularizer.L2DecayRegularizer(
|
||||
regularization_coeff=1e-4)))
|
||||
gru = fluid.layers.dynamic_gru(
|
||||
input=pre_gru,
|
||||
size=grnn_hidden_dim,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-init_bound, high=init_bound),
|
||||
regularizer=fluid.regularizer.L2DecayRegularizer(
|
||||
regularization_coeff=1e-4)))
|
||||
|
||||
pre_gru_r = fluid.layers.fc(
|
||||
input=input_feature,
|
||||
size=grnn_hidden_dim * 3,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-init_bound, high=init_bound),
|
||||
regularizer=fluid.regularizer.L2DecayRegularizer(
|
||||
regularization_coeff=1e-4)))
|
||||
gru_r = fluid.layers.dynamic_gru(
|
||||
input=pre_gru_r,
|
||||
size=grnn_hidden_dim,
|
||||
is_reverse=True,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-init_bound, high=init_bound),
|
||||
regularizer=fluid.regularizer.L2DecayRegularizer(
|
||||
regularization_coeff=1e-4)))
|
||||
|
||||
bi_merge = fluid.layers.concat(input=[gru, gru_r], axis=1)
|
||||
return bi_merge
|
||||
|
||||
def _net_conf(word, target=None):
|
||||
"""
|
||||
Configure the network
|
||||
"""
|
||||
word_embedding = fluid.embedding(
|
||||
input=word,
|
||||
size=[vocab_size, word_emb_dim],
|
||||
dtype='float32',
|
||||
is_sparse=IS_SPARSE,
|
||||
param_attr=fluid.ParamAttr(
|
||||
learning_rate=emb_lr,
|
||||
name="word_emb",
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-init_bound, high=init_bound)))
|
||||
|
||||
input_feature = word_embedding
|
||||
for i in range(bigru_num):
|
||||
bigru_output = _bigru_layer(input_feature)
|
||||
input_feature = bigru_output
|
||||
|
||||
emission = fluid.layers.fc(
|
||||
size=num_labels,
|
||||
input=bigru_output,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-init_bound, high=init_bound),
|
||||
regularizer=fluid.regularizer.L2DecayRegularizer(
|
||||
regularization_coeff=1e-4)))
|
||||
|
||||
size = emission.shape[1]
|
||||
fluid.layers.create_parameter(
|
||||
shape=[size + 2, size], dtype=emission.dtype, name='crfw')
|
||||
crf_decode = fluid.layers.crf_decoding(
|
||||
input=emission, param_attr=fluid.ParamAttr(name='crfw'))
|
||||
|
||||
return crf_decode
|
||||
return _net_conf(word)
|
82
jieba/lac_small/predict.py
Normal file
82
jieba/lac_small/predict.py
Normal file
@ -0,0 +1,82 @@
|
||||
# -*- coding: UTF-8 -*-
|
||||
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import time
|
||||
import sys
|
||||
|
||||
import paddle.fluid as fluid
|
||||
import paddle
|
||||
|
||||
import jieba.lac_small.utils as utils
|
||||
import jieba.lac_small.creator as creator
|
||||
import jieba.lac_small.reader_small as reader_small
|
||||
import numpy
|
||||
|
||||
word_emb_dim=128
|
||||
grnn_hidden_dim=128
|
||||
bigru_num=2
|
||||
use_cuda=False
|
||||
basepath = os.path.abspath(__file__)
|
||||
folder = os.path.dirname(basepath)
|
||||
init_checkpoint = os.path.join(folder, "model_baseline")
|
||||
batch_size=1
|
||||
|
||||
dataset = reader_small.Dataset()
|
||||
infer_program = fluid.Program()
|
||||
with fluid.program_guard(infer_program, fluid.default_startup_program()):
|
||||
with fluid.unique_name.guard():
|
||||
infer_ret = creator.create_model(dataset.vocab_size, dataset.num_labels, mode='infer')
|
||||
infer_program = infer_program.clone(for_test=True)
|
||||
place = fluid.CPUPlace()
|
||||
exe = fluid.Executor(place)
|
||||
exe.run(fluid.default_startup_program())
|
||||
utils.init_checkpoint(exe, init_checkpoint, infer_program)
|
||||
results = []
|
||||
|
||||
def get_sent(str1):
|
||||
feed_data=dataset.get_vars(str1)
|
||||
a = numpy.array(feed_data).astype(numpy.int64)
|
||||
a=a.reshape(-1,1)
|
||||
c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
|
||||
|
||||
words, crf_decode = exe.run(
|
||||
infer_program,
|
||||
fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
|
||||
feed={"words":c, },
|
||||
return_numpy=False,
|
||||
use_program_cache=True)
|
||||
sents=[]
|
||||
sent,tag = utils.parse_result(words, crf_decode, dataset)
|
||||
sents = sents + sent
|
||||
return sents
|
||||
|
||||
def get_result(str1):
|
||||
feed_data=dataset.get_vars(str1)
|
||||
a = numpy.array(feed_data).astype(numpy.int64)
|
||||
a=a.reshape(-1,1)
|
||||
c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
|
||||
|
||||
words, crf_decode = exe.run(
|
||||
infer_program,
|
||||
fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
|
||||
feed={"words":c, },
|
||||
return_numpy=False,
|
||||
use_program_cache=True)
|
||||
results=[]
|
||||
results += utils.parse_result(words, crf_decode, dataset)
|
||||
return results
|
100
jieba/lac_small/reader_small.py
Normal file
100
jieba/lac_small/reader_small.py
Normal file
@ -0,0 +1,100 @@
|
||||
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
The file_reader converts raw corpus to input.
|
||||
"""
|
||||
|
||||
import os
|
||||
import __future__
|
||||
import io
|
||||
import paddle
|
||||
import paddle.fluid as fluid
|
||||
|
||||
def load_kv_dict(dict_path,
|
||||
reverse=False,
|
||||
delimiter="\t",
|
||||
key_func=None,
|
||||
value_func=None):
|
||||
"""
|
||||
Load key-value dict from file
|
||||
"""
|
||||
result_dict = {}
|
||||
for line in io.open(dict_path, "r", encoding='utf8'):
|
||||
terms = line.strip("\n").split(delimiter)
|
||||
if len(terms) != 2:
|
||||
continue
|
||||
if reverse:
|
||||
value, key = terms
|
||||
else:
|
||||
key, value = terms
|
||||
if key in result_dict:
|
||||
raise KeyError("key duplicated with [%s]" % (key))
|
||||
if key_func:
|
||||
key = key_func(key)
|
||||
if value_func:
|
||||
value = value_func(value)
|
||||
result_dict[key] = value
|
||||
return result_dict
|
||||
|
||||
class Dataset(object):
|
||||
"""data reader"""
|
||||
def __init__(self):
|
||||
# read dict
|
||||
basepath = os.path.abspath(__file__)
|
||||
folder = os.path.dirname(basepath)
|
||||
word_dict_path = os.path.join(folder, "word.dic")
|
||||
label_dict_path = os.path.join(folder, "tag.dic")
|
||||
self.word2id_dict = load_kv_dict(
|
||||
word_dict_path, reverse=True, value_func=int)
|
||||
self.id2word_dict = load_kv_dict(word_dict_path)
|
||||
self.label2id_dict = load_kv_dict(
|
||||
label_dict_path, reverse=True, value_func=int)
|
||||
self.id2label_dict = load_kv_dict(label_dict_path)
|
||||
|
||||
@property
|
||||
def vocab_size(self):
|
||||
"""vocabulary size"""
|
||||
return max(self.word2id_dict.values()) + 1
|
||||
|
||||
@property
|
||||
def num_labels(self):
|
||||
"""num_labels"""
|
||||
return max(self.label2id_dict.values()) + 1
|
||||
|
||||
def word_to_ids(self, words):
|
||||
"""convert word to word index"""
|
||||
word_ids = []
|
||||
for word in words:
|
||||
if word not in self.word2id_dict:
|
||||
word = "OOV"
|
||||
word_id = self.word2id_dict[word]
|
||||
word_ids.append(word_id)
|
||||
return word_ids
|
||||
|
||||
def label_to_ids(self, labels):
|
||||
"""convert label to label index"""
|
||||
label_ids = []
|
||||
for label in labels:
|
||||
if label not in self.label2id_dict:
|
||||
label = "O"
|
||||
label_id = self.label2id_dict[label]
|
||||
label_ids.append(label_id)
|
||||
return label_ids
|
||||
|
||||
def get_vars(self,str1):
|
||||
words = str1.strip()
|
||||
word_ids = self.word_to_ids(words)
|
||||
return word_ids
|
||||
|
||||
|
57
jieba/lac_small/tag.dic
Normal file
57
jieba/lac_small/tag.dic
Normal file
@ -0,0 +1,57 @@
|
||||
0 a-B
|
||||
1 a-I
|
||||
2 ad-B
|
||||
3 ad-I
|
||||
4 an-B
|
||||
5 an-I
|
||||
6 c-B
|
||||
7 c-I
|
||||
8 d-B
|
||||
9 d-I
|
||||
10 f-B
|
||||
11 f-I
|
||||
12 m-B
|
||||
13 m-I
|
||||
14 n-B
|
||||
15 n-I
|
||||
16 nr-B
|
||||
17 nr-I
|
||||
18 ns-B
|
||||
19 ns-I
|
||||
20 nt-B
|
||||
21 nt-I
|
||||
22 nw-B
|
||||
23 nw-I
|
||||
24 nz-B
|
||||
25 nz-I
|
||||
26 p-B
|
||||
27 p-I
|
||||
28 q-B
|
||||
29 q-I
|
||||
30 r-B
|
||||
31 r-I
|
||||
32 s-B
|
||||
33 s-I
|
||||
34 t-B
|
||||
35 t-I
|
||||
36 u-B
|
||||
37 u-I
|
||||
38 v-B
|
||||
39 v-I
|
||||
40 vd-B
|
||||
41 vd-I
|
||||
42 vn-B
|
||||
43 vn-I
|
||||
44 w-B
|
||||
45 w-I
|
||||
46 xc-B
|
||||
47 xc-I
|
||||
48 PER-B
|
||||
49 PER-I
|
||||
50 LOC-B
|
||||
51 LOC-I
|
||||
52 ORG-B
|
||||
53 ORG-I
|
||||
54 TIME-B
|
||||
55 TIME-I
|
||||
56 O
|
142
jieba/lac_small/utils.py
Normal file
142
jieba/lac_small/utils.py
Normal file
@ -0,0 +1,142 @@
|
||||
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
util tools
|
||||
"""
|
||||
from __future__ import print_function
|
||||
import os
|
||||
import sys
|
||||
import numpy as np
|
||||
import paddle.fluid as fluid
|
||||
import io
|
||||
|
||||
|
||||
def str2bool(v):
|
||||
"""
|
||||
argparse does not support True or False in python
|
||||
"""
|
||||
return v.lower() in ("true", "t", "1")
|
||||
|
||||
|
||||
|
||||
def parse_result(words, crf_decode, dataset):
|
||||
""" parse result """
|
||||
offset_list = (crf_decode.lod())[0]
|
||||
words = np.array(words)
|
||||
crf_decode = np.array(crf_decode)
|
||||
batch_size = len(offset_list) - 1
|
||||
|
||||
for sent_index in range(batch_size):
|
||||
begin, end = offset_list[sent_index], offset_list[sent_index + 1]
|
||||
sent=[]
|
||||
for id in words[begin:end]:
|
||||
if dataset.id2word_dict[str(id[0])]=='OOV':
|
||||
sent.append(' ')
|
||||
else:
|
||||
sent.append(dataset.id2word_dict[str(id[0])])
|
||||
tags = [
|
||||
dataset.id2label_dict[str(id[0])] for id in crf_decode[begin:end]
|
||||
]
|
||||
|
||||
sent_out = []
|
||||
tags_out = []
|
||||
parital_word = ""
|
||||
for ind, tag in enumerate(tags):
|
||||
# for the first word
|
||||
if parital_word == "":
|
||||
parital_word = sent[ind]
|
||||
tags_out.append(tag.split('-')[0])
|
||||
continue
|
||||
|
||||
# for the beginning of word
|
||||
if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
|
||||
sent_out.append(parital_word)
|
||||
tags_out.append(tag.split('-')[0])
|
||||
parital_word = sent[ind]
|
||||
continue
|
||||
|
||||
parital_word += sent[ind]
|
||||
|
||||
# append the last word, except for len(tags)=0
|
||||
if len(sent_out) < len(tags_out):
|
||||
sent_out.append(parital_word)
|
||||
return sent_out,tags_out
|
||||
|
||||
def parse_padding_result(words, crf_decode, seq_lens, dataset):
|
||||
""" parse padding result """
|
||||
words = np.squeeze(words)
|
||||
batch_size = len(seq_lens)
|
||||
|
||||
batch_out = []
|
||||
for sent_index in range(batch_size):
|
||||
|
||||
sent=[]
|
||||
for id in words[begin:end]:
|
||||
if dataset.id2word_dict[str(id[0])]=='OOV':
|
||||
sent.append(' ')
|
||||
else:
|
||||
sent.append(dataset.id2word_dict[str(id[0])])
|
||||
tags = [
|
||||
dataset.id2label_dict[str(id)]
|
||||
for id in crf_decode[sent_index][1:seq_lens[sent_index] - 1]
|
||||
]
|
||||
|
||||
sent_out = []
|
||||
tags_out = []
|
||||
parital_word = ""
|
||||
for ind, tag in enumerate(tags):
|
||||
# for the first word
|
||||
if parital_word == "":
|
||||
parital_word = sent[ind]
|
||||
tags_out.append(tag.split('-')[0])
|
||||
continue
|
||||
|
||||
# for the beginning of word
|
||||
if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
|
||||
sent_out.append(parital_word)
|
||||
tags_out.append(tag.split('-')[0])
|
||||
parital_word = sent[ind]
|
||||
continue
|
||||
|
||||
parital_word += sent[ind]
|
||||
|
||||
# append the last word, except for len(tags)=0
|
||||
if len(sent_out) < len(tags_out):
|
||||
sent_out.append(parital_word)
|
||||
|
||||
batch_out.append([sent_out, tags_out])
|
||||
return batch_out
|
||||
|
||||
|
||||
def init_checkpoint(exe, init_checkpoint_path, main_program):
|
||||
"""
|
||||
Init CheckPoint
|
||||
"""
|
||||
assert os.path.exists(
|
||||
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
|
||||
|
||||
def existed_persitables(var):
|
||||
"""
|
||||
If existed presitabels
|
||||
"""
|
||||
if not fluid.io.is_persistable(var):
|
||||
return False
|
||||
return os.path.exists(os.path.join(init_checkpoint_path, var.name))
|
||||
|
||||
fluid.io.load_vars(
|
||||
exe,
|
||||
init_checkpoint_path,
|
||||
main_program=main_program,
|
||||
predicate=existed_persitables)
|
||||
|
20940
jieba/lac_small/word.dic
Normal file
20940
jieba/lac_small/word.dic
Normal file
File diff suppressed because it is too large
Load Diff
29
jieba/posseg/__init__.py
Normal file → Executable file
29
jieba/posseg/__init__.py
Normal file → Executable file
@ -1,11 +1,11 @@
|
||||
from __future__ import absolute_import, unicode_literals
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import jieba
|
||||
|
||||
import pickle
|
||||
from .._compat import *
|
||||
import re
|
||||
|
||||
import jieba
|
||||
from .viterbi import viterbi
|
||||
from .._compat import *
|
||||
|
||||
PROB_START_P = "prob_start.p"
|
||||
PROB_TRANS_P = "prob_trans.p"
|
||||
@ -252,6 +252,7 @@ class POSTokenizer(object):
|
||||
def lcut(self, *args, **kwargs):
|
||||
return list(self.cut(*args, **kwargs))
|
||||
|
||||
|
||||
# default Tokenizer instance
|
||||
|
||||
dt = POSTokenizer(jieba.dt)
|
||||
@ -269,13 +270,25 @@ def _lcut_internal_no_hmm(s):
|
||||
return dt._lcut_internal_no_hmm(s)
|
||||
|
||||
|
||||
def cut(sentence, HMM=True):
|
||||
def cut(sentence, HMM=True, use_paddle=False):
|
||||
"""
|
||||
Global `cut` function that supports parallel processing.
|
||||
|
||||
Note that this only works using dt, custom POSTokenizer
|
||||
instances are not supported.
|
||||
"""
|
||||
is_paddle_installed = check_paddle_install['is_paddle_installed']
|
||||
if use_paddle and is_paddle_installed:
|
||||
# if sentence is null, it will raise core exception in paddle.
|
||||
if sentence is None or sentence == "" or sentence == u"":
|
||||
return
|
||||
import jieba.lac_small.predict as predict
|
||||
sents, tags = predict.get_result(strdecode(sentence))
|
||||
for i, sent in enumerate(sents):
|
||||
if sent is None or tags[i] is None:
|
||||
continue
|
||||
yield pair(sent, tags[i])
|
||||
return
|
||||
global dt
|
||||
if jieba.pool is None:
|
||||
for w in dt.cut(sentence, HMM=HMM):
|
||||
@ -291,5 +304,7 @@ def cut(sentence, HMM=True):
|
||||
yield w
|
||||
|
||||
|
||||
def lcut(sentence, HMM=True):
|
||||
def lcut(sentence, HMM=True, use_paddle=False):
|
||||
if use_paddle:
|
||||
return list(cut(sentence, use_paddle=True))
|
||||
return list(cut(sentence, HMM))
|
||||
|
6
setup.py
6
setup.py
@ -43,8 +43,8 @@ GitHub: https://github.com/fxsjy/jieba
|
||||
"""
|
||||
|
||||
setup(name='jieba',
|
||||
version='0.39',
|
||||
description='Chinese Words Segementation Utilities',
|
||||
version='0.42.1',
|
||||
description='Chinese Words Segmentation Utilities',
|
||||
long_description=LONGDOC,
|
||||
author='Sun, Junyi',
|
||||
author_email='ccnusjy@gmail.com',
|
||||
@ -71,5 +71,5 @@ setup(name='jieba',
|
||||
keywords='NLP,tokenizing,Chinese word segementation',
|
||||
packages=['jieba'],
|
||||
package_dir={'jieba':'jieba'},
|
||||
package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*']}
|
||||
package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*', 'lac_small/*.py','lac_small/*.dic', 'lac_small/model_baseline/*']}
|
||||
)
|
||||
|
@ -96,3 +96,6 @@ if __name__ == "__main__":
|
||||
cuttest('AT&T是一件不错的公司,给你发offer了吗?')
|
||||
cuttest('C++和c#是什么关系?11+122=133,是吗?PI=3.14159')
|
||||
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
|
||||
jieba.add_word('超敏C反应蛋白')
|
||||
cuttest('超敏C反应蛋白是什么, java好学吗?,小潘老板都学Python')
|
||||
cuttest('steel健身爆发力运动兴奋补充剂')
|
||||
|
102
test/test_paddle.py
Normal file
102
test/test_paddle.py
Normal file
@ -0,0 +1,102 @@
|
||||
#encoding=utf-8
|
||||
import sys
|
||||
sys.path.append("../")
|
||||
import jieba
|
||||
jieba.enable_paddle()
|
||||
|
||||
def cuttest(test_sent):
|
||||
result = jieba.cut(test_sent, use_paddle=True)
|
||||
print(" / ".join(result))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。")
|
||||
cuttest("我不喜欢日本和服。")
|
||||
cuttest("雷猴回归人间。")
|
||||
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
|
||||
cuttest("我需要廉租房")
|
||||
cuttest("永和服装饰品有限公司")
|
||||
cuttest("我爱北京天安门")
|
||||
cuttest("abc")
|
||||
cuttest("隐马尔可夫")
|
||||
cuttest("雷猴是个好网站")
|
||||
cuttest("“Microsoft”一词由“MICROcomputer(微型计算机)”和“SOFTware(软件)”两部分组成")
|
||||
cuttest("草泥马和欺实马是今年的流行词汇")
|
||||
cuttest("伊藤洋华堂总府店")
|
||||
cuttest("中国科学院计算技术研究所")
|
||||
cuttest("罗密欧与朱丽叶")
|
||||
cuttest("我购买了道具和服装")
|
||||
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
|
||||
cuttest("湖北省石首市")
|
||||
cuttest("湖北省十堰市")
|
||||
cuttest("总经理完成了这件事情")
|
||||
cuttest("电脑修好了")
|
||||
cuttest("做好了这件事情就一了百了了")
|
||||
cuttest("人们审美的观点是不同的")
|
||||
cuttest("我们买了一个美的空调")
|
||||
cuttest("线程初始化时我们要注意")
|
||||
cuttest("一个分子是由好多原子组织成的")
|
||||
cuttest("祝你马到功成")
|
||||
cuttest("他掉进了无底洞里")
|
||||
cuttest("中国的首都是北京")
|
||||
cuttest("孙君意")
|
||||
cuttest("外交部发言人马朝旭")
|
||||
cuttest("领导人会议和第四届东亚峰会")
|
||||
cuttest("在过去的这五年")
|
||||
cuttest("还需要很长的路要走")
|
||||
cuttest("60周年首都阅兵")
|
||||
cuttest("你好人们审美的观点是不同的")
|
||||
cuttest("买水果然后来世博园")
|
||||
cuttest("买水果然后去世博园")
|
||||
cuttest("但是后来我才知道你是对的")
|
||||
cuttest("存在即合理")
|
||||
cuttest("的的的的的在的的的的就以和和和")
|
||||
cuttest("I love你,不以为耻,反以为rong")
|
||||
cuttest("因")
|
||||
cuttest("")
|
||||
cuttest("hello你好人们审美的观点是不同的")
|
||||
cuttest("很好但主要是基于网页形式")
|
||||
cuttest("hello你好人们审美的观点是不同的")
|
||||
cuttest("为什么我不能拥有想要的生活")
|
||||
cuttest("后来我才")
|
||||
cuttest("此次来中国是为了")
|
||||
cuttest("使用了它就可以解决一些问题")
|
||||
cuttest(",使用了它就可以解决一些问题")
|
||||
cuttest("其实使用了它就可以解决一些问题")
|
||||
cuttest("好人使用了它就可以解决一些问题")
|
||||
cuttest("是因为和国家")
|
||||
cuttest("老年搜索还支持")
|
||||
cuttest("干脆就把那部蒙人的闲法给废了拉倒!RT @laoshipukong : 27日,全国人大常委会第三次审议侵权责任法草案,删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
|
||||
cuttest("大")
|
||||
cuttest("")
|
||||
cuttest("他说的确实在理")
|
||||
cuttest("长春市长春节讲话")
|
||||
cuttest("结婚的和尚未结婚的")
|
||||
cuttest("结合成分子时")
|
||||
cuttest("旅游和服务是最好的")
|
||||
cuttest("这件事情的确是我的错")
|
||||
cuttest("供大家参考指正")
|
||||
cuttest("哈尔滨政府公布塌桥原因")
|
||||
cuttest("我在机场入口处")
|
||||
cuttest("邢永臣摄影报道")
|
||||
cuttest("BP神经网络如何训练才能在分类时增加区分度?")
|
||||
cuttest("南京市长江大桥")
|
||||
cuttest("应一些使用者的建议,也为了便于利用NiuTrans用于SMT研究")
|
||||
cuttest('长春市长春药店')
|
||||
cuttest('邓颖超生前最喜欢的衣服')
|
||||
cuttest('胡锦涛是热爱世界和平的政治局常委')
|
||||
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
|
||||
cuttest('一次性交多少钱')
|
||||
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
|
||||
cuttest('小和尚留了一个像大和尚一样的和尚头')
|
||||
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
|
||||
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
|
||||
cuttest('AT&T是一件不错的公司,给你发offer了吗?')
|
||||
cuttest('C++和c#是什么关系?11+122=133,是吗?PI=3.14159')
|
||||
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
|
||||
cuttest('枪杆子中出政权')
|
||||
cuttest('张三风同学走上了不归路')
|
||||
cuttest('阿Q腰间挂着BB机手里拿着大哥大,说:我一般吃饭不AA制的。')
|
||||
cuttest('在1号店能买到小S和大S八卦的书,还有3D电视。')
|
||||
jieba.del_word('很赞')
|
||||
cuttest('看上去iphone8手机样式很赞,售价699美元,销量涨了5%么?')
|
102
test/test_paddle_postag.py
Normal file
102
test/test_paddle_postag.py
Normal file
@ -0,0 +1,102 @@
|
||||
#encoding=utf-8
|
||||
import sys
|
||||
sys.path.append("../")
|
||||
import jieba.posseg as pseg
|
||||
import jieba
|
||||
jieba.enable_paddle()
|
||||
|
||||
def cuttest(test_sent):
|
||||
result = pseg.cut(test_sent, use_paddle=True)
|
||||
for word, flag in result:
|
||||
print('%s %s' % (word, flag))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。")
|
||||
cuttest("我不喜欢日本和服。")
|
||||
cuttest("雷猴回归人间。")
|
||||
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
|
||||
cuttest("我需要廉租房")
|
||||
cuttest("永和服装饰品有限公司")
|
||||
cuttest("我爱北京天安门")
|
||||
cuttest("abc")
|
||||
cuttest("隐马尔可夫")
|
||||
cuttest("雷猴是个好网站")
|
||||
cuttest("“Microsoft”一词由“MICROcomputer(微型计算机)”和“SOFTware(软件)”两部分组成")
|
||||
cuttest("草泥马和欺实马是今年的流行词汇")
|
||||
cuttest("伊藤洋华堂总府店")
|
||||
cuttest("中国科学院计算技术研究所")
|
||||
cuttest("罗密欧与朱丽叶")
|
||||
cuttest("我购买了道具和服装")
|
||||
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
|
||||
cuttest("湖北省石首市")
|
||||
cuttest("湖北省十堰市")
|
||||
cuttest("总经理完成了这件事情")
|
||||
cuttest("电脑修好了")
|
||||
cuttest("做好了这件事情就一了百了了")
|
||||
cuttest("人们审美的观点是不同的")
|
||||
cuttest("我们买了一个美的空调")
|
||||
cuttest("线程初始化时我们要注意")
|
||||
cuttest("一个分子是由好多原子组织成的")
|
||||
cuttest("祝你马到功成")
|
||||
cuttest("他掉进了无底洞里")
|
||||
cuttest("中国的首都是北京")
|
||||
cuttest("孙君意")
|
||||
cuttest("外交部发言人马朝旭")
|
||||
cuttest("领导人会议和第四届东亚峰会")
|
||||
cuttest("在过去的这五年")
|
||||
cuttest("还需要很长的路要走")
|
||||
cuttest("60周年首都阅兵")
|
||||
cuttest("你好人们审美的观点是不同的")
|
||||
cuttest("买水果然后来世博园")
|
||||
cuttest("买水果然后去世博园")
|
||||
cuttest("但是后来我才知道你是对的")
|
||||
cuttest("存在即合理")
|
||||
cuttest("的的的的的在的的的的就以和和和")
|
||||
cuttest("I love你,不以为耻,反以为rong")
|
||||
cuttest("因")
|
||||
cuttest("")
|
||||
cuttest("hello你好人们审美的观点是不同的")
|
||||
cuttest("很好但主要是基于网页形式")
|
||||
cuttest("hello你好人们审美的观点是不同的")
|
||||
cuttest("为什么我不能拥有想要的生活")
|
||||
cuttest("后来我才")
|
||||
cuttest("此次来中国是为了")
|
||||
cuttest("使用了它就可以解决一些问题")
|
||||
cuttest(",使用了它就可以解决一些问题")
|
||||
cuttest("其实使用了它就可以解决一些问题")
|
||||
cuttest("好人使用了它就可以解决一些问题")
|
||||
cuttest("是因为和国家")
|
||||
cuttest("老年搜索还支持")
|
||||
cuttest("干脆就把那部蒙人的闲法给废了拉倒!RT @laoshipukong : 27日,全国人大常委会第三次审议侵权责任法草案,删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
|
||||
cuttest("大")
|
||||
cuttest("")
|
||||
cuttest("他说的确实在理")
|
||||
cuttest("长春市长春节讲话")
|
||||
cuttest("结婚的和尚未结婚的")
|
||||
cuttest("结合成分子时")
|
||||
cuttest("旅游和服务是最好的")
|
||||
cuttest("这件事情的确是我的错")
|
||||
cuttest("供大家参考指正")
|
||||
cuttest("哈尔滨政府公布塌桥原因")
|
||||
cuttest("我在机场入口处")
|
||||
cuttest("邢永臣摄影报道")
|
||||
cuttest("BP神经网络如何训练才能在分类时增加区分度?")
|
||||
cuttest("南京市长江大桥")
|
||||
cuttest("应一些使用者的建议,也为了便于利用NiuTrans用于SMT研究")
|
||||
cuttest('长春市长春药店')
|
||||
cuttest('邓颖超生前最喜欢的衣服')
|
||||
cuttest('胡锦涛是热爱世界和平的政治局常委')
|
||||
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
|
||||
cuttest('一次性交多少钱')
|
||||
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
|
||||
cuttest('小和尚留了一个像大和尚一样的和尚头')
|
||||
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
|
||||
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
|
||||
cuttest('AT&T是一件不错的公司,给你发offer了吗?')
|
||||
cuttest('C++和c#是什么关系?11+122=133,是吗?PI=3.14159')
|
||||
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
|
||||
cuttest('枪杆子中出政权')
|
||||
cuttest('张三风同学走上了不归路')
|
||||
cuttest('阿Q腰间挂着BB机手里拿着大哥大,说:我一般吃饭不AA制的。')
|
||||
cuttest('在1号店能买到小S和大S八卦的书,还有3D电视。')
|
@ -6,7 +6,7 @@ from whoosh.index import create_in,open_dir
|
||||
from whoosh.fields import *
|
||||
from whoosh.qparser import QueryParser
|
||||
|
||||
from jieba.analyse import ChineseAnalyzer
|
||||
from jieba.analyse.analyzer import ChineseAnalyzer
|
||||
|
||||
analyzer = ChineseAnalyzer()
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user