Compare commits

...

33 Commits

Author SHA1 Message Date
Neutrino
67fa2e36e7
Update README.md update paddle link. (#817) 2020-02-15 16:33:35 +08:00
fxsjy
1e20c89b66 fix setup.py in python2.7 2020-01-20 22:22:34 +08:00
fxsjy
5704e23bbf update version: 0.42 2020-01-13 21:24:45 +08:00
fxsjy
aa65031788 fix file mode 2020-01-13 21:03:38 +08:00
fxsjy
2eb11c8028 fix issue #810 2020-01-13 20:53:43 +08:00
JesseyXujin
d703bce302 paddle coredump exception fix (#807)
* paddle_null_point_fix

* add core expception note

* delete yield

* modify test paddle for supporting enable_paddle()
2020-01-10 16:30:46 +08:00
vissssa
dc2b788eb3 refactor: improvement check_paddle_installed (#806) 2020-01-09 19:23:11 +08:00
fxsjy
0868c323d9 update version in __init__.py 2020-01-08 16:21:07 +08:00
fxsjy
eb37e048da update version to 0.41 2020-01-08 16:04:30 +08:00
JesseyXujin
381b0691ac Add enable_paddle interface to install paddle and import packages (#802)
* enable_paddle_interface

* Add enable_paddle interface to install paddle and import packages

* Add enable_paddle interface to install paddle and import packages

* add posseg lcut for paddle mode

* fix vocabulary
2020-01-08 15:26:12 +08:00
fxsjy
97c32464e1 fix issue #798 2020-01-03 14:10:48 +08:00
Tim Gates
0489a6979e Fix simple typo: vocabuary -> vocabulary (#797)
Closes #796
2020-01-02 10:26:00 +08:00
JesseyXujin
30ea8f929e Simplify Paddle import check (#795) 2019-12-31 15:03:14 +08:00
JesseyXujin
0b74b6c2de add jieba upgrade not in README.md and change import imp to import importlib in _compat.py (#794) 2019-12-31 14:14:50 +08:00
Sun Junyi
2fdee89883
Update README.md 2019-12-30 17:11:22 +08:00
JesseyXujin
17bab6a2d1 修改paddle版本检测报错机制 (#790) 2019-12-25 19:46:49 +08:00
Sun Junyi
80947ff843
Update Changelog 2019-12-25 10:49:02 +08:00
fxsjy
68ce6955b7 update version to 0.40 2019-12-25 10:35:22 +08:00
fxsjy
d47e14e5b3 update version 2019-12-25 10:34:18 +08:00
pkpk
27910094ac Fix bugs in Paddle seg and Paddle postag (#789)
* fix bugs in paddle seg and paddle postag

* fix compat in checking paddle
2019-12-24 21:02:55 +08:00
Sun Junyi
9dc8e6d992
Update README.md 2019-12-24 19:29:17 +08:00
fxsjy
478c3b9bb4 lazy import paddle 2019-12-24 19:19:51 +08:00
JesseyXujin
5b3bb4b7f2 加入paddle分词和词性标注功能 (#788)
* paddle cut release

* 修改README.md,提示用户安装paddlepaddle.tiny

* 删除两个init.py文件中utf头文件

* 修改readme细节
2019-12-24 17:27:41 +08:00
Hongxiang Lin
38134ee20f 修复suggest_freq中add_word指向的bug (#723) 2019-07-01 19:43:45 +08:00
Paul Meng
3645a5bb5d Update README.md (#745) 2019-07-01 19:41:47 +08:00
Sun Junyi
8212b6c572
Update README.md 2018-12-03 16:29:32 +08:00
Sun Junyi
843cdc2b7c
Merge pull request #582 from hosiet/pr-fix-typo-codespell
Fix typos found by codespell
2018-09-20 10:44:47 +08:00
Sun Junyi
68f2a64f7e
Merge pull request #663 from JimCurryWang/patch-1
Fix  __init__ "-" symbol issue
2018-09-20 10:40:35 +08:00
Sun Junyi
4c8479cfa6
Merge pull request #667 from ZhengZixiang/patch-1
fix the error about importing ChineseAnalyzer
2018-09-20 10:39:29 +08:00
imzhengzx
ca444fb4da
fix the error about imoprting ChineseAnalyzer
Because of the interface change about ChineseAnlayzer , the code 'from jieba.analyse import Chinese Analyzer' in this test file would report an ImportError like 'cannot import name 'ChineseAnalyzer'. Just change import code to 'from jieba.analyse.analyzer import ChineseAnalyzer' can fix it.
2018-09-15 11:59:01 +08:00
CY Wang
36a27302ce
Fix __init__ "-" symbol issue
Solving "-" symbol can't be analyze issue . 

For example,
In keyword , chap-EX喬沛詩 , SK-II  ...etc 
the present version will show "chap", "-", "EX喬沛詩" , "SK", "-", "II"

After the modify,
The new version will show  "chap-EX","喬沛詩" , "SK-II" 

ps: I have used the jieba.load_userdict() , and added  "chap-EX" , "喬沛詩", "SK-II" in the userdict.txt.
2018-08-27 17:05:46 +08:00
Sun Junyi
7653db2e33
Update README.md 2018-07-04 17:18:02 +08:00
Boyuan Yang
17ef8abba3
Fix typos found by codespell 2018-01-21 19:15:48 +08:00
38 changed files with 21882 additions and 53 deletions

View File

@ -1,3 +1,21 @@
2019-1-20: version 0.42.1
1. 修复setup.py在python2.7版本无法工作的问题 (issue #809)
2019-1-13: version 0.42
1. 修复paddle模式空字符串coredump问题 @JesseyXujin
2. 修复cut_all模式切分丢字问题 @fxsjy
3. paddle安装检测优化 @vissssa
2019-1-8: version 0.41
1. 开启paddle模式更友好
2. 修复cut_all模式不支持中英混合词的bug
2019-12-25: version 0.40
1. 支持基于paddle的深度学习分词模式(use_paddle=True); by @JesseyXujin, @xyzhou-puck
2. 修复自定义Tokenizer实例的add_word方法指向全局的问题; by @linhx13
3. 修复whoosh测试用例的引用bug; by @ZhengZixiang
4. 修复自定义词库不支持含"-"符号的问题by @JimCurryWang
2017-08-28: version 0.39 2017-08-28: version 0.39
1. del_word支持强行拆开词语; by @gumblex,@fxsjy 1. del_word支持强行拆开词语; by @gumblex,@fxsjy
2. 修复百分数的切词; by @fxsjy 2. 修复百分数的切词; by @fxsjy

View File

@ -9,24 +9,15 @@ jieba
特点 特点
======== ========
* 支持种分词模式: * 支持种分词模式:
* 精确模式,试图将句子最精确地切开,适合文本分析; * 精确模式,试图将句子最精确地切开,适合文本分析;
* 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义; * 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
* 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。 * 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
* paddle模式利用PaddlePaddle深度学习框架训练序列标注双向GRU网络模型实现分词。同时支持词性标注。paddle模式使用需安装paddlepaddle-tiny`pip install paddlepaddle-tiny==1.6.1`。目前paddle模式支持jieba v0.40及以上版本。jieba v0.40以下版本请升级jieba`pip install jieba --upgrade` 。[PaddlePaddle官网](https://www.paddlepaddle.org.cn/)
* 支持繁体分词 * 支持繁体分词
* 支持自定义词典 * 支持自定义词典
* MIT 授权协议 * MIT 授权协议
在线演示
=========
http://jiebademo.ap01.aws.af.cm/
(Powered by Appfog)
网站代码https://github.com/fxsjy/jiebademo
安装说明 安装说明
======= =======
@ -36,6 +27,7 @@ http://jiebademo.ap01.aws.af.cm/
* 半自动安装:先下载 http://pypi.python.org/pypi/jieba/ ,解压后运行 `python setup.py install` * 半自动安装:先下载 http://pypi.python.org/pypi/jieba/ ,解压后运行 `python setup.py install`
* 手动安装:将 jieba 目录放置于当前目录或者 site-packages 目录 * 手动安装:将 jieba 目录放置于当前目录或者 site-packages 目录
* 通过 `import jieba` 来引用 * 通过 `import jieba` 来引用
* 如果需要使用paddle模式下的分词和词性标注功能请先安装paddlepaddle-tiny`pip install paddlepaddle-tiny==1.6.1`
算法 算法
======== ========
@ -47,7 +39,7 @@ http://jiebademo.ap01.aws.af.cm/
======= =======
1. 分词 1. 分词
-------- --------
* `jieba.cut` 方法接受个输入参数: 需要分词的字符串cut_all 参数用来控制是否采用全模式HMM 参数用来控制是否使用 HMM 模型 * `jieba.cut` 方法接受个输入参数: 需要分词的字符串cut_all 参数用来控制是否采用全模式HMM 参数用来控制是否使用 HMM 模型use_paddle 参数用来控制是否使用paddle模式下的分词模式paddle模式采用延迟加载方式通过enable_paddle接口安装paddlepaddle-tiny并且import相关代码
* `jieba.cut_for_search` 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细 * `jieba.cut_for_search` 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
* 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意:不建议直接输入 GBK 字符串,可能无法预料地错误解码成 UTF-8 * 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意:不建议直接输入 GBK 字符串,可能无法预料地错误解码成 UTF-8
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator可以使用 for 循环来获得分词后得到的每一个词语(unicode),或者用 * `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator可以使用 for 循环来获得分词后得到的每一个词语(unicode),或者用
@ -60,6 +52,12 @@ http://jiebademo.ap01.aws.af.cm/
# encoding=utf-8 # encoding=utf-8
import jieba import jieba
jieba.enable_paddle()# 启动paddle模式。 0.40版之后开始支持,早期版本不支持
strs=["我来到北京清华大学","乒乓球拍卖完了","中国科学技术大学"]
for str in strs:
seg_list = jieba.cut(str,use_paddle=True) # 使用paddle模式
print("Paddle Mode: " + '/'.join(list(seg_list)))
seg_list = jieba.cut("我来到北京清华大学", cut_all=True) seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) # 全模式 print("Full Mode: " + "/ ".join(seg_list)) # 全模式
@ -195,11 +193,15 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
----------- -----------
* `jieba.posseg.POSTokenizer(tokenizer=None)` 新建自定义分词器,`tokenizer` 参数可指定内部使用的 `jieba.Tokenizer` 分词器。`jieba.posseg.dt` 为默认词性标注分词器。 * `jieba.posseg.POSTokenizer(tokenizer=None)` 新建自定义分词器,`tokenizer` 参数可指定内部使用的 `jieba.Tokenizer` 分词器。`jieba.posseg.dt` 为默认词性标注分词器。
* 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法。 * 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法。
* 除了jieba默认分词模式提供paddle模式下的词性标注功能。paddle模式采用延迟加载方式通过enable_paddle()安装paddlepaddle-tiny并且import相关代码
* 用法示例 * 用法示例
```pycon ```pycon
>>> import jieba
>>> import jieba.posseg as pseg >>> import jieba.posseg as pseg
>>> words = pseg.cut("我爱北京天安门") >>> words = pseg.cut("我爱北京天安门") #jieba默认模式
>>> jieba.enable_paddle() #启动paddle模式。 0.40版之后开始支持,早期版本不支持
>>> words = pseg.cut("我爱北京天安门",use_paddle=True) #paddle模式
>>> for word, flag in words: >>> for word, flag in words:
... print('%s %s' % (word, flag)) ... print('%s %s' % (word, flag))
... ...
@ -209,6 +211,21 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
天安门 ns 天安门 ns
``` ```
paddle模式词性标注对应表如下
paddle模式词性和专名类别标签集合如下表其中词性标签 24 个(小写字母),专名类别标签 4 个(大写字母)。
| 标签 | 含义 | 标签 | 含义 | 标签 | 含义 | 标签 | 含义 |
| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
| n | 普通名词 | f | 方位名词 | s | 处所名词 | t | 时间 |
| nr | 人名 | ns | 地名 | nt | 机构名 | nw | 作品名 |
| nz | 其他专名 | v | 普通动词 | vd | 动副词 | vn | 名动词 |
| a | 形容词 | ad | 副形词 | an | 名形词 | d | 副词 |
| m | 数量词 | q | 量词 | r | 代词 | p | 介词 |
| c | 连词 | u | 助词 | xc | 其他虚词 | w | 标点符号 |
| PER | 人名 | LOC | 地名 | ORG | 机构名 | TIME | 时间 |
5. 并行分词 5. 并行分词
----------- -----------
* 原理:将目标文本按行分隔后,把各行文本分配到多个 Python 进程并行分词,然后归并结果,从而获得分词速度的可观提升 * 原理:将目标文本按行分隔后,把各行文本分配到多个 Python 进程并行分词,然后归并结果,从而获得分词速度的可观提升
@ -362,6 +379,11 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
作者yanyiwu 作者yanyiwu
地址https://github.com/yanyiwu/cppjieba 地址https://github.com/yanyiwu/cppjieba
结巴分词 Rust 版本
----------------
作者messense, MnO2
地址https://github.com/messense/jieba-rs
结巴分词 Node.js 版本 结巴分词 Node.js 版本
---------------- ----------------
作者yanyiwu 作者yanyiwu
@ -398,6 +420,17 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
+ 作者: wangbin 地址: https://github.com/wangbin/jiebago + 作者: wangbin 地址: https://github.com/wangbin/jiebago
+ 作者: yanyiwu 地址: https://github.com/yanyiwu/gojieba + 作者: yanyiwu 地址: https://github.com/yanyiwu/gojieba
结巴分词Android版本
------------------
+ 作者 Dongliang.W 地址https://github.com/452896915/jieba-android
友情链接
=========
* https://github.com/baidu/lac 百度中文词法分析(分词+词性+专名)系统
* https://github.com/baidu/AnyQ 百度FAQ自动问答系统
* https://github.com/baidu/Senta 百度情感识别系统
系统集成 系统集成
======== ========
1. Solr: https://github.com/sing1ee/jieba-solr 1. Solr: https://github.com/sing1ee/jieba-solr

View File

@ -1,19 +1,18 @@
from __future__ import absolute_import, unicode_literals from __future__ import absolute_import, unicode_literals
__version__ = '0.39'
__version__ = '0.42.1'
__license__ = 'MIT' __license__ = 'MIT'
import re
import os
import sys
import time
import logging
import marshal import marshal
import re
import tempfile import tempfile
import threading import threading
from math import log import time
from hashlib import md5 from hashlib import md5
from ._compat import * from math import log
from . import finalseg from . import finalseg
from ._compat import *
if os.name == 'nt': if os.name == 'nt':
from shutil import move as _replace_file from shutil import move as _replace_file
@ -40,15 +39,17 @@ re_eng = re.compile('[a-zA-Z0-9]', re.U)
# \u4E00-\u9FD5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han # \u4E00-\u9FD5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
# \r\n|\s : whitespace characters. Will not be handled. # \r\n|\s : whitespace characters. Will not be handled.
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U) # re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
# Adding "-" symbol in re_han_default
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
re_skip_default = re.compile("(\r\n|\s)", re.U) re_skip_default = re.compile("(\r\n|\s)", re.U)
re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)
re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)
def setLogLevel(log_level): def setLogLevel(log_level):
global logger
default_logger.setLevel(log_level) default_logger.setLevel(log_level)
class Tokenizer(object): class Tokenizer(object):
def __init__(self, dictionary=DEFAULT_DICT): def __init__(self, dictionary=DEFAULT_DICT):
@ -67,7 +68,8 @@ class Tokenizer(object):
def __repr__(self): def __repr__(self):
return '<Tokenizer dictionary=%r>' % self.dictionary return '<Tokenizer dictionary=%r>' % self.dictionary
def gen_pfdict(self, f): @staticmethod
def gen_pfdict(f):
lfreq = {} lfreq = {}
ltotal = 0 ltotal = 0
f_name = resolve_filename(f) f_name = resolve_filename(f)
@ -161,7 +163,7 @@ class Tokenizer(object):
self.initialized = True self.initialized = True
default_logger.debug( default_logger.debug(
"Loading model cost %.3f seconds." % (time.time() - t1)) "Loading model cost %.3f seconds." % (time.time() - t1))
default_logger.debug("Prefix dict has been built succesfully.") default_logger.debug("Prefix dict has been built successfully.")
def check_initialized(self): def check_initialized(self):
if not self.initialized: if not self.initialized:
@ -196,15 +198,30 @@ class Tokenizer(object):
def __cut_all(self, sentence): def __cut_all(self, sentence):
dag = self.get_DAG(sentence) dag = self.get_DAG(sentence)
old_j = -1 old_j = -1
eng_scan = 0
eng_buf = u''
for k, L in iteritems(dag): for k, L in iteritems(dag):
if eng_scan == 1 and not re_eng.match(sentence[k]):
eng_scan = 0
yield eng_buf
if len(L) == 1 and k > old_j: if len(L) == 1 and k > old_j:
yield sentence[k:L[0] + 1] word = sentence[k:L[0] + 1]
if re_eng.match(word):
if eng_scan == 0:
eng_scan = 1
eng_buf = word
else:
eng_buf += word
if eng_scan == 0:
yield word
old_j = L[0] old_j = L[0]
else: else:
for j in L: for j in L:
if j > k: if j > k:
yield sentence[k:j + 1] yield sentence[k:j + 1]
old_j = j old_j = j
if eng_scan == 1:
yield eng_buf
def __cut_DAG_NO_HMM(self, sentence): def __cut_DAG_NO_HMM(self, sentence):
DAG = self.get_DAG(sentence) DAG = self.get_DAG(sentence)
@ -269,22 +286,29 @@ class Tokenizer(object):
for elem in buf: for elem in buf:
yield elem yield elem
def cut(self, sentence, cut_all=False, HMM=True): def cut(self, sentence, cut_all=False, HMM=True, use_paddle=False):
''' """
The main function that segments an entire sentence that contains The main function that segments an entire sentence that contains
Chinese characters into seperated words. Chinese characters into separated words.
Parameter: Parameter:
- sentence: The str(unicode) to be segmented. - sentence: The str(unicode) to be segmented.
- cut_all: Model type. True for full pattern, False for accurate pattern. - cut_all: Model type. True for full pattern, False for accurate pattern.
- HMM: Whether to use the Hidden Markov Model. - HMM: Whether to use the Hidden Markov Model.
''' """
is_paddle_installed = check_paddle_install['is_paddle_installed']
sentence = strdecode(sentence) sentence = strdecode(sentence)
if use_paddle and is_paddle_installed:
if cut_all: # if sentence is null, it will raise core exception in paddle.
re_han = re_han_cut_all if sentence is None or len(sentence) == 0:
re_skip = re_skip_cut_all return
else: import jieba.lac_small.predict as predict
results = predict.get_sent(sentence)
for sent in results:
if sent is None:
continue
yield sent
return
re_han = re_han_default re_han = re_han_default
re_skip = re_skip_default re_skip = re_skip_default
if cut_all: if cut_all:
@ -446,7 +470,7 @@ class Tokenizer(object):
freq *= self.FREQ.get(seg, 1) / ftotal freq *= self.FREQ.get(seg, 1) / ftotal
freq = min(int(freq * self.total), self.FREQ.get(word, 0)) freq = min(int(freq * self.total), self.FREQ.get(word, 0))
if tune: if tune:
add_word(word, freq) self.add_word(word, freq)
return freq return freq
def tokenize(self, unicode_sentence, mode="default", HMM=True): def tokenize(self, unicode_sentence, mode="default", HMM=True):

View File

@ -1,15 +1,56 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
import logging
import os import os
import sys import sys
log_console = logging.StreamHandler(sys.stderr)
default_logger = logging.getLogger(__name__)
default_logger.setLevel(logging.DEBUG)
def setLogLevel(log_level):
default_logger.setLevel(log_level)
check_paddle_install = {'is_paddle_installed': False}
try: try:
import pkg_resources import pkg_resources
get_module_res = lambda *res: pkg_resources.resource_stream(__name__, get_module_res = lambda *res: pkg_resources.resource_stream(__name__,
os.path.join(*res)) os.path.join(*res))
except ImportError: except ImportError:
get_module_res = lambda *res: open(os.path.normpath(os.path.join( get_module_res = lambda *res: open(os.path.normpath(os.path.join(
os.getcwd(), os.path.dirname(__file__), *res)), 'rb') os.getcwd(), os.path.dirname(__file__), *res)), 'rb')
def enable_paddle():
try:
import paddle
except ImportError:
default_logger.debug("Installing paddle-tiny, please wait a minute......")
os.system("pip install paddlepaddle-tiny")
try:
import paddle
except ImportError:
default_logger.debug(
"Import paddle error, please use command to install: pip install paddlepaddle-tiny==1.6.1."
"Now, back to jieba basic cut......")
if paddle.__version__ < '1.6.1':
default_logger.debug("Find your own paddle version doesn't satisfy the minimum requirement (1.6.1), "
"please install paddle tiny by 'pip install --upgrade paddlepaddle-tiny', "
"or upgrade paddle full version by "
"'pip install --upgrade paddlepaddle (-gpu for GPU version)' ")
else:
try:
import jieba.lac_small.predict as predict
default_logger.debug("Paddle enabled successfully......")
check_paddle_install['is_paddle_installed'] = True
except ImportError:
default_logger.debug("Import error, cannot find paddle.fluid and jieba.lac_small.predict module. "
"Now, back to jieba basic cut......")
PY2 = sys.version_info[0] == 2 PY2 = sys.version_info[0] == 2
default_encoding = sys.getfilesystemencoding() default_encoding = sys.getfilesystemencoding()
@ -31,6 +72,7 @@ else:
itervalues = lambda d: iter(d.values()) itervalues = lambda d: iter(d.values())
iteritems = lambda d: iter(d.items()) iteritems = lambda d: iter(d.items())
def strdecode(sentence): def strdecode(sentence):
if not isinstance(sentence, text_type): if not isinstance(sentence, text_type):
try: try:
@ -39,6 +81,7 @@ def strdecode(sentence):
sentence = sentence.decode('gbk', 'ignore') sentence = sentence.decode('gbk', 'ignore')
return sentence return sentence
def resolve_filename(f): def resolve_filename(f):
try: try:
return f.name return f.name

View File

View File

@ -0,0 +1,46 @@
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Define the function to create lexical analysis model and model's data reader
"""
import sys
import os
import math
import paddle
import paddle.fluid as fluid
from paddle.fluid.initializer import NormalInitializer
import jieba.lac_small.nets as nets
def create_model(vocab_size, num_labels, mode='train'):
"""create lac model"""
# model's input data
words = fluid.data(name='words', shape=[-1, 1], dtype='int64', lod_level=1)
targets = fluid.data(
name='targets', shape=[-1, 1], dtype='int64', lod_level=1)
# for inference process
if mode == 'infer':
crf_decode = nets.lex_net(
words, vocab_size, num_labels, for_infer=True, target=None)
return {
"feed_list": [words],
"words": words,
"crf_decode": crf_decode,
}
return ret

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

122
jieba/lac_small/nets.py Normal file
View File

@ -0,0 +1,122 @@
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
The function lex_net(args) define the lexical analysis network structure
"""
import sys
import os
import math
import paddle.fluid as fluid
from paddle.fluid.initializer import NormalInitializer
def lex_net(word, vocab_size, num_labels, for_infer=True, target=None):
"""
define the lexical analysis network structure
word: stores the input of the model
for_infer: a boolean value, indicating if the model to be created is for training or predicting.
return:
for infer: return the prediction
otherwise: return the prediction
"""
word_emb_dim=128
grnn_hidden_dim=128
bigru_num=2
emb_lr = 1.0
crf_lr = 1.0
init_bound = 0.1
IS_SPARSE = True
def _bigru_layer(input_feature):
"""
define the bidirectional gru layer
"""
pre_gru = fluid.layers.fc(
input=input_feature,
size=grnn_hidden_dim * 3,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound),
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
gru = fluid.layers.dynamic_gru(
input=pre_gru,
size=grnn_hidden_dim,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound),
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
pre_gru_r = fluid.layers.fc(
input=input_feature,
size=grnn_hidden_dim * 3,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound),
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
gru_r = fluid.layers.dynamic_gru(
input=pre_gru_r,
size=grnn_hidden_dim,
is_reverse=True,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound),
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
bi_merge = fluid.layers.concat(input=[gru, gru_r], axis=1)
return bi_merge
def _net_conf(word, target=None):
"""
Configure the network
"""
word_embedding = fluid.embedding(
input=word,
size=[vocab_size, word_emb_dim],
dtype='float32',
is_sparse=IS_SPARSE,
param_attr=fluid.ParamAttr(
learning_rate=emb_lr,
name="word_emb",
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound)))
input_feature = word_embedding
for i in range(bigru_num):
bigru_output = _bigru_layer(input_feature)
input_feature = bigru_output
emission = fluid.layers.fc(
size=num_labels,
input=bigru_output,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound),
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
size = emission.shape[1]
fluid.layers.create_parameter(
shape=[size + 2, size], dtype=emission.dtype, name='crfw')
crf_decode = fluid.layers.crf_decoding(
input=emission, param_attr=fluid.ParamAttr(name='crfw'))
return crf_decode
return _net_conf(word)

View File

@ -0,0 +1,82 @@
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import time
import sys
import paddle.fluid as fluid
import paddle
import jieba.lac_small.utils as utils
import jieba.lac_small.creator as creator
import jieba.lac_small.reader_small as reader_small
import numpy
word_emb_dim=128
grnn_hidden_dim=128
bigru_num=2
use_cuda=False
basepath = os.path.abspath(__file__)
folder = os.path.dirname(basepath)
init_checkpoint = os.path.join(folder, "model_baseline")
batch_size=1
dataset = reader_small.Dataset()
infer_program = fluid.Program()
with fluid.program_guard(infer_program, fluid.default_startup_program()):
with fluid.unique_name.guard():
infer_ret = creator.create_model(dataset.vocab_size, dataset.num_labels, mode='infer')
infer_program = infer_program.clone(for_test=True)
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
utils.init_checkpoint(exe, init_checkpoint, infer_program)
results = []
def get_sent(str1):
feed_data=dataset.get_vars(str1)
a = numpy.array(feed_data).astype(numpy.int64)
a=a.reshape(-1,1)
c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
words, crf_decode = exe.run(
infer_program,
fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
feed={"words":c, },
return_numpy=False,
use_program_cache=True)
sents=[]
sent,tag = utils.parse_result(words, crf_decode, dataset)
sents = sents + sent
return sents
def get_result(str1):
feed_data=dataset.get_vars(str1)
a = numpy.array(feed_data).astype(numpy.int64)
a=a.reshape(-1,1)
c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
words, crf_decode = exe.run(
infer_program,
fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
feed={"words":c, },
return_numpy=False,
use_program_cache=True)
results=[]
results += utils.parse_result(words, crf_decode, dataset)
return results

View File

@ -0,0 +1,100 @@
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
The file_reader converts raw corpus to input.
"""
import os
import __future__
import io
import paddle
import paddle.fluid as fluid
def load_kv_dict(dict_path,
reverse=False,
delimiter="\t",
key_func=None,
value_func=None):
"""
Load key-value dict from file
"""
result_dict = {}
for line in io.open(dict_path, "r", encoding='utf8'):
terms = line.strip("\n").split(delimiter)
if len(terms) != 2:
continue
if reverse:
value, key = terms
else:
key, value = terms
if key in result_dict:
raise KeyError("key duplicated with [%s]" % (key))
if key_func:
key = key_func(key)
if value_func:
value = value_func(value)
result_dict[key] = value
return result_dict
class Dataset(object):
"""data reader"""
def __init__(self):
# read dict
basepath = os.path.abspath(__file__)
folder = os.path.dirname(basepath)
word_dict_path = os.path.join(folder, "word.dic")
label_dict_path = os.path.join(folder, "tag.dic")
self.word2id_dict = load_kv_dict(
word_dict_path, reverse=True, value_func=int)
self.id2word_dict = load_kv_dict(word_dict_path)
self.label2id_dict = load_kv_dict(
label_dict_path, reverse=True, value_func=int)
self.id2label_dict = load_kv_dict(label_dict_path)
@property
def vocab_size(self):
"""vocabulary size"""
return max(self.word2id_dict.values()) + 1
@property
def num_labels(self):
"""num_labels"""
return max(self.label2id_dict.values()) + 1
def word_to_ids(self, words):
"""convert word to word index"""
word_ids = []
for word in words:
if word not in self.word2id_dict:
word = "OOV"
word_id = self.word2id_dict[word]
word_ids.append(word_id)
return word_ids
def label_to_ids(self, labels):
"""convert label to label index"""
label_ids = []
for label in labels:
if label not in self.label2id_dict:
label = "O"
label_id = self.label2id_dict[label]
label_ids.append(label_id)
return label_ids
def get_vars(self,str1):
words = str1.strip()
word_ids = self.word_to_ids(words)
return word_ids

57
jieba/lac_small/tag.dic Normal file
View File

@ -0,0 +1,57 @@
0 a-B
1 a-I
2 ad-B
3 ad-I
4 an-B
5 an-I
6 c-B
7 c-I
8 d-B
9 d-I
10 f-B
11 f-I
12 m-B
13 m-I
14 n-B
15 n-I
16 nr-B
17 nr-I
18 ns-B
19 ns-I
20 nt-B
21 nt-I
22 nw-B
23 nw-I
24 nz-B
25 nz-I
26 p-B
27 p-I
28 q-B
29 q-I
30 r-B
31 r-I
32 s-B
33 s-I
34 t-B
35 t-I
36 u-B
37 u-I
38 v-B
39 v-I
40 vd-B
41 vd-I
42 vn-B
43 vn-I
44 w-B
45 w-I
46 xc-B
47 xc-I
48 PER-B
49 PER-I
50 LOC-B
51 LOC-I
52 ORG-B
53 ORG-I
54 TIME-B
55 TIME-I
56 O

142
jieba/lac_small/utils.py Normal file
View File

@ -0,0 +1,142 @@
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
util tools
"""
from __future__ import print_function
import os
import sys
import numpy as np
import paddle.fluid as fluid
import io
def str2bool(v):
"""
argparse does not support True or False in python
"""
return v.lower() in ("true", "t", "1")
def parse_result(words, crf_decode, dataset):
""" parse result """
offset_list = (crf_decode.lod())[0]
words = np.array(words)
crf_decode = np.array(crf_decode)
batch_size = len(offset_list) - 1
for sent_index in range(batch_size):
begin, end = offset_list[sent_index], offset_list[sent_index + 1]
sent=[]
for id in words[begin:end]:
if dataset.id2word_dict[str(id[0])]=='OOV':
sent.append(' ')
else:
sent.append(dataset.id2word_dict[str(id[0])])
tags = [
dataset.id2label_dict[str(id[0])] for id in crf_decode[begin:end]
]
sent_out = []
tags_out = []
parital_word = ""
for ind, tag in enumerate(tags):
# for the first word
if parital_word == "":
parital_word = sent[ind]
tags_out.append(tag.split('-')[0])
continue
# for the beginning of word
if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
sent_out.append(parital_word)
tags_out.append(tag.split('-')[0])
parital_word = sent[ind]
continue
parital_word += sent[ind]
# append the last word, except for len(tags)=0
if len(sent_out) < len(tags_out):
sent_out.append(parital_word)
return sent_out,tags_out
def parse_padding_result(words, crf_decode, seq_lens, dataset):
""" parse padding result """
words = np.squeeze(words)
batch_size = len(seq_lens)
batch_out = []
for sent_index in range(batch_size):
sent=[]
for id in words[begin:end]:
if dataset.id2word_dict[str(id[0])]=='OOV':
sent.append(' ')
else:
sent.append(dataset.id2word_dict[str(id[0])])
tags = [
dataset.id2label_dict[str(id)]
for id in crf_decode[sent_index][1:seq_lens[sent_index] - 1]
]
sent_out = []
tags_out = []
parital_word = ""
for ind, tag in enumerate(tags):
# for the first word
if parital_word == "":
parital_word = sent[ind]
tags_out.append(tag.split('-')[0])
continue
# for the beginning of word
if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
sent_out.append(parital_word)
tags_out.append(tag.split('-')[0])
parital_word = sent[ind]
continue
parital_word += sent[ind]
# append the last word, except for len(tags)=0
if len(sent_out) < len(tags_out):
sent_out.append(parital_word)
batch_out.append([sent_out, tags_out])
return batch_out
def init_checkpoint(exe, init_checkpoint_path, main_program):
"""
Init CheckPoint
"""
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
def existed_persitables(var):
"""
If existed presitabels
"""
if not fluid.io.is_persistable(var):
return False
return os.path.exists(os.path.join(init_checkpoint_path, var.name))
fluid.io.load_vars(
exe,
init_checkpoint_path,
main_program=main_program,
predicate=existed_persitables)

20940
jieba/lac_small/word.dic Normal file

File diff suppressed because it is too large Load Diff

29
jieba/posseg/__init__.py Normal file → Executable file
View File

@ -1,11 +1,11 @@
from __future__ import absolute_import, unicode_literals from __future__ import absolute_import, unicode_literals
import os
import re
import sys
import jieba
import pickle import pickle
from .._compat import * import re
import jieba
from .viterbi import viterbi from .viterbi import viterbi
from .._compat import *
PROB_START_P = "prob_start.p" PROB_START_P = "prob_start.p"
PROB_TRANS_P = "prob_trans.p" PROB_TRANS_P = "prob_trans.p"
@ -252,6 +252,7 @@ class POSTokenizer(object):
def lcut(self, *args, **kwargs): def lcut(self, *args, **kwargs):
return list(self.cut(*args, **kwargs)) return list(self.cut(*args, **kwargs))
# default Tokenizer instance # default Tokenizer instance
dt = POSTokenizer(jieba.dt) dt = POSTokenizer(jieba.dt)
@ -269,13 +270,25 @@ def _lcut_internal_no_hmm(s):
return dt._lcut_internal_no_hmm(s) return dt._lcut_internal_no_hmm(s)
def cut(sentence, HMM=True): def cut(sentence, HMM=True, use_paddle=False):
""" """
Global `cut` function that supports parallel processing. Global `cut` function that supports parallel processing.
Note that this only works using dt, custom POSTokenizer Note that this only works using dt, custom POSTokenizer
instances are not supported. instances are not supported.
""" """
is_paddle_installed = check_paddle_install['is_paddle_installed']
if use_paddle and is_paddle_installed:
# if sentence is null, it will raise core exception in paddle.
if sentence is None or sentence == "" or sentence == u"":
return
import jieba.lac_small.predict as predict
sents, tags = predict.get_result(strdecode(sentence))
for i, sent in enumerate(sents):
if sent is None or tags[i] is None:
continue
yield pair(sent, tags[i])
return
global dt global dt
if jieba.pool is None: if jieba.pool is None:
for w in dt.cut(sentence, HMM=HMM): for w in dt.cut(sentence, HMM=HMM):
@ -291,5 +304,7 @@ def cut(sentence, HMM=True):
yield w yield w
def lcut(sentence, HMM=True): def lcut(sentence, HMM=True, use_paddle=False):
if use_paddle:
return list(cut(sentence, use_paddle=True))
return list(cut(sentence, HMM)) return list(cut(sentence, HMM))

View File

@ -43,8 +43,8 @@ GitHub: https://github.com/fxsjy/jieba
""" """
setup(name='jieba', setup(name='jieba',
version='0.39', version='0.42.1',
description='Chinese Words Segementation Utilities', description='Chinese Words Segmentation Utilities',
long_description=LONGDOC, long_description=LONGDOC,
author='Sun, Junyi', author='Sun, Junyi',
author_email='ccnusjy@gmail.com', author_email='ccnusjy@gmail.com',
@ -71,5 +71,5 @@ setup(name='jieba',
keywords='NLP,tokenizing,Chinese word segementation', keywords='NLP,tokenizing,Chinese word segementation',
packages=['jieba'], packages=['jieba'],
package_dir={'jieba':'jieba'}, package_dir={'jieba':'jieba'},
package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*']} package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*', 'lac_small/*.py','lac_small/*.dic', 'lac_small/model_baseline/*']}
) )

View File

@ -96,3 +96,6 @@ if __name__ == "__main__":
cuttest('AT&T是一件不错的公司给你发offer了吗') cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159') cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。') cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
jieba.add_word('超敏C反应蛋白')
cuttest('超敏C反应蛋白是什么, java好学吗?,小潘老板都学Python')
cuttest('steel健身爆发力运动兴奋补充剂')

102
test/test_paddle.py Normal file
View File

@ -0,0 +1,102 @@
#encoding=utf-8
import sys
sys.path.append("../")
import jieba
jieba.enable_paddle()
def cuttest(test_sent):
result = jieba.cut(test_sent, use_paddle=True)
print(" / ".join(result))
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('枪杆子中出政权')
cuttest('张三风同学走上了不归路')
cuttest('阿Q腰间挂着BB机手里拿着大哥大我一般吃饭不AA制的。')
cuttest('在1号店能买到小S和大S八卦的书还有3D电视。')
jieba.del_word('很赞')
cuttest('看上去iphone8手机样式很赞,售价699美元,销量涨了5%么?')

102
test/test_paddle_postag.py Normal file
View File

@ -0,0 +1,102 @@
#encoding=utf-8
import sys
sys.path.append("../")
import jieba.posseg as pseg
import jieba
jieba.enable_paddle()
def cuttest(test_sent):
result = pseg.cut(test_sent, use_paddle=True)
for word, flag in result:
print('%s %s' % (word, flag))
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('枪杆子中出政权')
cuttest('张三风同学走上了不归路')
cuttest('阿Q腰间挂着BB机手里拿着大哥大我一般吃饭不AA制的。')
cuttest('在1号店能买到小S和大S八卦的书还有3D电视。')

View File

@ -6,7 +6,7 @@ from whoosh.index import create_in,open_dir
from whoosh.fields import * from whoosh.fields import *
from whoosh.qparser import QueryParser from whoosh.qparser import QueryParser
from jieba.analyse import ChineseAnalyzer from jieba.analyse.analyzer import ChineseAnalyzer
analyzer = ChineseAnalyzer() analyzer = ChineseAnalyzer()