Compare commits

...

461 Commits

Author SHA1 Message Date
Neutrino
67fa2e36e7
Update README.md update paddle link. (#817) 2020-02-15 16:33:35 +08:00
fxsjy
1e20c89b66 fix setup.py in python2.7 2020-01-20 22:22:34 +08:00
fxsjy
5704e23bbf update version: 0.42 2020-01-13 21:24:45 +08:00
fxsjy
aa65031788 fix file mode 2020-01-13 21:03:38 +08:00
fxsjy
2eb11c8028 fix issue #810 2020-01-13 20:53:43 +08:00
JesseyXujin
d703bce302 paddle coredump exception fix (#807)
* paddle_null_point_fix

* add core expception note

* delete yield

* modify test paddle for supporting enable_paddle()
2020-01-10 16:30:46 +08:00
vissssa
dc2b788eb3 refactor: improvement check_paddle_installed (#806) 2020-01-09 19:23:11 +08:00
fxsjy
0868c323d9 update version in __init__.py 2020-01-08 16:21:07 +08:00
fxsjy
eb37e048da update version to 0.41 2020-01-08 16:04:30 +08:00
JesseyXujin
381b0691ac Add enable_paddle interface to install paddle and import packages (#802)
* enable_paddle_interface

* Add enable_paddle interface to install paddle and import packages

* Add enable_paddle interface to install paddle and import packages

* add posseg lcut for paddle mode

* fix vocabulary
2020-01-08 15:26:12 +08:00
fxsjy
97c32464e1 fix issue #798 2020-01-03 14:10:48 +08:00
Tim Gates
0489a6979e Fix simple typo: vocabuary -> vocabulary (#797)
Closes #796
2020-01-02 10:26:00 +08:00
JesseyXujin
30ea8f929e Simplify Paddle import check (#795) 2019-12-31 15:03:14 +08:00
JesseyXujin
0b74b6c2de add jieba upgrade not in README.md and change import imp to import importlib in _compat.py (#794) 2019-12-31 14:14:50 +08:00
Sun Junyi
2fdee89883
Update README.md 2019-12-30 17:11:22 +08:00
JesseyXujin
17bab6a2d1 修改paddle版本检测报错机制 (#790) 2019-12-25 19:46:49 +08:00
Sun Junyi
80947ff843
Update Changelog 2019-12-25 10:49:02 +08:00
fxsjy
68ce6955b7 update version to 0.40 2019-12-25 10:35:22 +08:00
fxsjy
d47e14e5b3 update version 2019-12-25 10:34:18 +08:00
pkpk
27910094ac Fix bugs in Paddle seg and Paddle postag (#789)
* fix bugs in paddle seg and paddle postag

* fix compat in checking paddle
2019-12-24 21:02:55 +08:00
Sun Junyi
9dc8e6d992
Update README.md 2019-12-24 19:29:17 +08:00
fxsjy
478c3b9bb4 lazy import paddle 2019-12-24 19:19:51 +08:00
JesseyXujin
5b3bb4b7f2 加入paddle分词和词性标注功能 (#788)
* paddle cut release

* 修改README.md,提示用户安装paddlepaddle.tiny

* 删除两个init.py文件中utf头文件

* 修改readme细节
2019-12-24 17:27:41 +08:00
Hongxiang Lin
38134ee20f 修复suggest_freq中add_word指向的bug (#723) 2019-07-01 19:43:45 +08:00
Paul Meng
3645a5bb5d Update README.md (#745) 2019-07-01 19:41:47 +08:00
Sun Junyi
8212b6c572
Update README.md 2018-12-03 16:29:32 +08:00
Sun Junyi
843cdc2b7c
Merge pull request #582 from hosiet/pr-fix-typo-codespell
Fix typos found by codespell
2018-09-20 10:44:47 +08:00
Sun Junyi
68f2a64f7e
Merge pull request #663 from JimCurryWang/patch-1
Fix  __init__ "-" symbol issue
2018-09-20 10:40:35 +08:00
Sun Junyi
4c8479cfa6
Merge pull request #667 from ZhengZixiang/patch-1
fix the error about importing ChineseAnalyzer
2018-09-20 10:39:29 +08:00
imzhengzx
ca444fb4da
fix the error about imoprting ChineseAnalyzer
Because of the interface change about ChineseAnlayzer , the code 'from jieba.analyse import Chinese Analyzer' in this test file would report an ImportError like 'cannot import name 'ChineseAnalyzer'. Just change import code to 'from jieba.analyse.analyzer import ChineseAnalyzer' can fix it.
2018-09-15 11:59:01 +08:00
CY Wang
36a27302ce
Fix __init__ "-" symbol issue
Solving "-" symbol can't be analyze issue . 

For example,
In keyword , chap-EX喬沛詩 , SK-II  ...etc 
the present version will show "chap", "-", "EX喬沛詩" , "SK", "-", "II"

After the modify,
The new version will show  "chap-EX","喬沛詩" , "SK-II" 

ps: I have used the jieba.load_userdict() , and added  "chap-EX" , "喬沛詩", "SK-II" in the userdict.txt.
2018-08-27 17:05:46 +08:00
Sun Junyi
7653db2e33
Update README.md 2018-07-04 17:18:02 +08:00
Boyuan Yang
17ef8abba3
Fix typos found by codespell 2018-01-21 19:15:48 +08:00
fxsjy
cb0de2973b version change 0.39 2017-08-28 21:40:18 +08:00
sunjunyi01
b4dd5b58f3 bug fix, issue: #511, #512 2017-08-28 21:10:50 +08:00
Sun Junyi
4eef868338 Merge pull request #455 from OOCZC/master
Update README.md
2017-04-06 15:22:01 +08:00
OOC
b485ae916c Update README.md 2017-04-04 11:45:53 +08:00
OOC
ee0ce32bbd Update 2017-04-04 11:17:44 +08:00
Sun Junyi
8ba26cf97e Merge pull request #382 from huntzhan/master
Bugfix for HMM=False in parallelism.
2016-08-05 10:02:41 +08:00
huntzhan
60acefd9b1 Bugfix for HMM=False in parallelism. 2016-08-04 17:43:35 +08:00
Sun Junyi
03cd4b5fb6 Merge pull request #367 from yanyiwu/patch-1
Update README.md
2016-06-12 09:37:16 +08:00
Yanyi Wu
76ae798137 Update README.md 2016-06-10 22:48:01 +08:00
Sun Junyi
0243d568e9 Merge pull request #351 from gumblex/master
fix del_word
2016-03-16 10:22:34 +08:00
Dingyuan Wang
12b2b17741 fix del_word 2016-03-15 18:58:12 +08:00
fxsjy
1d5ea9f061 version change 0.38 2015-12-16 16:12:49 +08:00
Sun Junyi
e5c9af78e2 Merge pull request #315 from gumblex/master
命令行分词支持词性标注
2015-11-17 19:13:36 +08:00
Dingyuan Wang
87734d3785 support POS tagging in __main__ 2015-11-17 19:06:44 +08:00
Sun Junyi
3d29b0c8e8 Merge pull request #310 from gumblex/master
Fix compatibility problem with `with` statememt
2015-11-13 14:22:50 +08:00
Dingyuan Wang
1fcd3a417c fix compatibility problem with with statememt 2015-11-13 13:16:19 +08:00
Sun Junyi
093980647b Merge pull request #303 from jerryday/master
add a withFlag param to extract_tags
2015-11-13 10:19:53 +08:00
Sun Junyi
f73a2183a5 Merge pull request #309 from gumblex/master
用 pkg_resources 载入默认字典
2015-11-13 10:18:50 +08:00
Dingyuan Wang
8814e08f9b load default dictionary from pkg_resources and improve the loading method;
change the serialized models from marshal to pickle
2015-11-12 20:18:09 +08:00
Sun Junyi
70f019b669 Merge pull request #307 from gumblex/master
扩充汉字范围;修正 load_userdict
2015-11-09 22:12:23 +08:00
Dingyuan Wang
5270ed66ff fix typo in type detection in load_userdict 2015-11-09 21:37:29 +08:00
Dingyuan Wang
99d0fb1a8a use regex and fix encoding related issues in load_userdict 2015-11-09 20:54:50 +08:00
Dingyuan Wang
1c33252fce change the recognized Chinese character range to [\u4E00-\u9FD5] 2015-11-09 20:23:43 +08:00
jerryday
e5e41a4aad fix pair object in dict problem 2015-10-30 16:38:50 +08:00
jerryday
4f8ca83661 add a withFlag param in textrank 2015-10-30 15:40:41 +08:00
jerryday
26e339f8f7 add a withFlag param to extract_tags 2015-10-30 11:09:24 +08:00
Sun Junyi
b6f1ce773e Merge pull request #298 from anderscui/master
Add introduction to jieba.NET port.
2015-09-23 06:54:56 +08:00
andersc
343bfe9783 Add introduction to jieba.NET port. 2015-09-22 23:16:02 +08:00
fxsjy
cb414cb861 version update 2015-06-27 16:49:44 +08:00
Sun Junyi
8e99a13aa9 Merge pull request #275 from gumblex/master
防止跨文件系统创建缓存
2015-06-26 23:22:42 +08:00
Dingyuan Wang
d0e68974bf improved doc for tmp_dir and cache_file 2015-06-26 22:24:20 +08:00
Dingyuan Wang
66fe17517d prevent moving across different filesystems at tempfile.mkstemp 2015-06-26 22:12:39 +08:00
Dingyuan Wang
be46ddef9a use shutil.move for all platforms in case of different filesystems 2015-06-26 21:52:53 +08:00
Sun Junyi
17652e764f Merge pull request #271 from gumblex/master
修复 cut_for_search;改善 pair 对象
2015-06-01 18:40:31 +08:00
Dingyuan Wang
ceb5c26be4 fix self.FREQ in cut_for_search; make pair object iterable 2015-06-01 14:36:38 +08:00
Sun Junyi
9f4d9376b0 Merge pull request #269 from gumblex/master
自定义字典允许指定词性同时省略词频
2015-05-24 19:56:51 +08:00
Dingyuan Wang
3b76328f2a allow ignoring word frequency while providing pos tag 2015-05-23 21:51:00 +08:00
Sun Junyi
3ec4c43788 Merge pull request #260 from gumblex/master
使用类包装全局函数
2015-05-11 10:24:49 +08:00
Dingyuan Wang
94840a734c wraps most globals in classes
API changes:
* class jieba.Tokenizer, jieba.posseg.POSTokenizer
* class jieba.analyse.TFIDF, jieba.analyse.TextRank
* global functions are mapped to jieba.(posseg.)dt, the default (POS)Tokenizer
* multiprocessing only works with jieba.(posseg.)dt
* new lcut, lcut_for_search functions that returns a list
* jieba.analyse.textrank now returns 20 items by default

Tests:
* added test_lock.py to test multithread locking
* demo.py now contains most of the examples in README
2015-05-09 21:29:05 +08:00
Sun Junyi
e359d08964 Merge pull request #257 from gip0/gip0-patch-1
fixed an error in load_userdict()
2015-05-02 17:27:16 +08:00
Gilbert Liu
f6e57ab2ae fixed an error in load_userdict() 2015-05-01 12:52:28 -07:00
Sun Junyi
60f0028175 Merge pull request #252 from fukuball/master
更新 README
2015-04-28 22:42:40 +08:00
Fukuball Lin
e712a4de61 更新 README
增加结巴分词 PHP 版本相關資訊
2015-04-28 22:05:26 +08:00
fxsjy
29d2b838dc a minor version on pypi, which removes *.pyc 2015-04-17 19:35:12 +08:00
fxsjy
c07b7fef54 hot-fix version for pull request #248 2015-04-10 18:54:51 +08:00
Sun Junyi
753c1be49c Merge pull request #248 from wangbin/master
exlucde word fragments from FREQ in posseg.cut
2015-04-02 15:32:41 +08:00
Wang Bin
84ffa0d4bf exlucde word fragments from FREQ 2015-04-02 11:06:55 +08:00
Sun Junyi
885417aed1 Merge pull request #247 from gumblex/master
更新文档
2015-03-21 17:05:05 +08:00
Dingyuan Wang
eeaab012bf update docs 2015-03-21 10:53:42 +08:00
fxsjy
89481cfd84 version update 0.36 2015-03-20 11:00:55 +08:00
Sun Junyi
59aa8b69b1 Merge pull request #246 from gumblex/master
增加自动词频
2015-03-16 10:10:53 +08:00
Dingyuan Wang
4fa2728fb6 update README about new features 2015-03-14 12:44:49 +08:00
Dingyuan Wang
4a552ca94f suggest word frequency, support passing str to add_word 2015-03-14 12:44:19 +08:00
Sun Junyi
1b4721ebb8 Merge pull request #179 from changyy/master
新增自訂 cache_file 產生的目錄位置,可支援 jieba 運行在 Read-Only File System,如: Embedded Linux、Google App Engine 和 Heroku 等
2015-02-28 10:05:52 +08:00
Yuan-Yi Chang
62433a3205 讓 jieba 可以自行指定 cache_file 產生的目錄位置,提供 jieba 在 Read-only file system 環境中運行
1.在呼叫 jieba.cut() 等相關動作前,先透過 jieba.tmp_dir 指定目錄位置
2.當應用環境為 Read-Only File System,可透過預先產生 cache_file 的機制,讓 jieba 正常運行
3.實際案例為 Google App Engine 和 Heroku,其中前者免費版僅 128MB 記憶體空間無法運行,後者免費環境有 512MB 可正常運行。發佈前,先在本地端產生 cache_file 後,連同 cache_file 一併發佈至 Google App Engine 或 Heroku 環境上即可使用。
2015-02-27 17:25:59 +08:00
Sun Junyi
4b4aff6d89 Merge pull request #242 from gumblex/master
textrank 细节问题;文档更新
2015-02-17 14:57:27 +08:00
Dingyuan Wang
f29430f49e details in textrank; update README 2015-02-16 21:25:55 +08:00
Sun Junyi
a4fb439070 Merge pull request #241 from sing1ee/master
improve some details from other commiters' adivces
2015-02-16 20:41:06 +08:00
zhangcheng
01b7f6efcf improve some details from other commiters' adivces 2015-02-16 20:35:45 +08:00
Sun Junyi
4e05cde07e Merge pull request #240 from sing1ee/master
build stable sort for graph iteration
2015-02-16 20:28:22 +08:00
zhangcheng
8b8c6c85d0 remove unusage import 2015-02-16 15:51:05 +08:00
zhangcheng
a6d1b2479e build stable sort for graph iteration, then we can get stable result and adatpe details for python 3~ 2015-02-16 15:49:10 +08:00
zhangcheng
1152db7736 build stable sort for graph iteration, then we can get stable result. 2015-02-16 15:46:36 +08:00
fxsjy
49657c976d make extract_tags behavior compatiable with previous version 2015-02-14 21:23:58 +08:00
fxsjy
abcaf3e475 fix bug: load_userdict 2015-02-14 19:56:38 +08:00
Jack
a06b7d388e fix bug in __main__.py 2015-02-12 14:08:39 +08:00
Sun Junyi
9ca5b69907 Merge pull request #238 from gumblex/master
use str.splitlines to avoid losing line breaks
2015-02-12 13:35:52 +08:00
Dingyuan Wang
f2b7183a71 use str.splitlines to avoid losing line breaks 2015-02-12 12:39:14 +08:00
Sun Junyi
b14eb329e3 Merge pull request #237 from gumblex/master
直接将前缀储存在词频字典里
2015-02-12 11:27:25 +08:00
Dingyuan Wang
872a7039f2 Merge branch 'master' of https://github.com/fxsjy/jieba 2015-02-12 10:33:56 +08:00
Dingyuan Wang
f808ea0ebb use only one dict to store words and prefixes 2015-02-12 10:31:52 +08:00
fxsjy
4d7b515801 Merge branch 'master' of https://github.com/fxsjy/jieba 2015-02-11 20:57:35 +08:00
fxsjy
5bfa43a781 fix test scripts 2015-02-11 20:46:48 +08:00
Dingyuan Wang
f3a53dd2da fix print() in tests 2015-02-11 20:45:55 +08:00
Sun Junyi
a229041e58 Merge pull request #234 from yanyiwu/patch-2
Update README.md
2015-02-11 18:48:47 +08:00
Yanyi Wu
5d321cbccd Update README.md 2015-02-11 17:37:32 +08:00
fxsjy
8cbb26a7b6 fix test_file.py 2015-02-11 16:47:57 +08:00
Sun Junyi
41b47b0593 Merge pull request #233 from gumblex/master
合并 jieba3k,兼容 Python 2/3
2015-02-11 15:44:22 +08:00
Dingyuan Wang
32a0e92a09 don't compile re every time; autopep8 2015-02-10 21:22:34 +08:00
Dingyuan Wang
22bcf8be7a Merge master and jieba3k, make the code Python 2/3 compatible 2015-02-10 20:54:55 +08:00
Sun Junyi
caae26fbfa Merge pull request #231 from gumblex/master
在 FREQ 中直接储存频数
2015-02-09 16:50:43 +08:00
Dingyuan Wang
4197dfb8fa store int directly in FREQ; small improvements 2015-02-09 16:26:00 +08:00
Dingyuan Wang
765fd6b7f0 store int directly in FREQ; small improvements 2015-02-09 16:14:12 +08:00
Sun Junyi
c95f402e2b Merge pull request #214 from aszxqw/master
add iosjieba
2014-12-25 10:09:35 +08:00
yanyiwu
1d91072498 add iosjieba 2014-12-24 23:02:06 +08:00
Sun Junyi
852a07c4f2 Merge pull request #211 from gumblex/jieba3k
修复 posseg 中 pair 类 repr 返回值 (jieba3k)
2014-12-20 18:35:43 +08:00
Dingyuan Wang
7bcb128f5f fix textrank divided by zero; fix posseg.pair.__repr__ 2014-12-20 00:12:42 +08:00
Sun Junyi
b08c3f8ed7 Merge pull request #205 from lynschinzer/master
Fix divided by zero issue in case of words are not found in dict.
2014-12-05 20:13:51 +08:00
Lin
fea3aec6bd Fix divided by zero issue in case of words are not found in dict. 2014-12-05 17:13:12 +08:00
Sun Junyi
8be082017a Merge pull request #204 from gumblex/jieba3k
完善setup.py等对应py3k更新
2014-11-29 18:28:48 +08:00
Sun Junyi
293dbbc390 Merge pull request #203 from gumblex/master
修复 posseg;完善 setup.py
2014-11-29 18:28:23 +08:00
Dingyuan Wang
3dad899ec8 backport 2to3 scripts and changelog 2014-11-29 16:12:25 +08:00
Dingyuan Wang
c6b386f65b update jieba3k 2014-11-29 16:06:20 +08:00
Dingyuan Wang
7b7c6955a9 complete the setup.py, fix #202 problem in posseg 2014-11-29 15:33:42 +08:00
Sun Junyi
8a2e7f0e7e Merge pull request #202 from nomaka/patch-1
Update __init__.py
2014-11-18 16:46:59 +08:00
Nomaka
9cb76dd8b9 Update __init__.py
calc的idx参数没用
2014-11-18 16:00:49 +08:00
Sun Junyi
99748bfc17 Merge pull request #201 from skyerown/master
为关键字提取函数增加词性过滤功能
2014-11-18 10:27:52 +08:00
walkskyer
a336e26403 为函数textrank增加参数allowPOS,并修改extract_tags的参数allowPOS与textrank保持一致。 2014-11-15 18:36:09 +08:00
walkskyer
bab5f362ba 将exstract_tags参数allowPOS转换为frozenset以减少查找时间。 2014-11-15 18:14:47 +08:00
Dingyuan Wang
6b0da06481 merge from upstream 2014-11-15 14:06:03 +08:00
fxsjy
5c487dbcba update verson 2014-11-15 13:46:27 +08:00
fxsjy
447c1ded8c fix problem for python3.2 2014-11-15 13:44:30 +08:00
walkskyer
dd62477605 .gitignore中忽略pycharm项目文件 2014-11-15 13:33:13 +08:00
Dingyuan Wang
a5ecf70f71 update to v0.35 2014-11-14 20:59:54 +08:00
walkskyer
d82d2c18df 为关键字提取函数增加词性过滤功能 2014-11-13 22:26:22 +08:00
fxsjy
315a411e52 version update 2014-11-13 10:43:43 +08:00
fxsjy
ec68c21ea0 version update' 2014-11-13 10:27:50 +08:00
Sun Junyi
3eea28d6f4 Merge pull request #200 from skyerown/master
修复stop words处理未考虑"\r"导致不能正常匹配的问题。
2014-11-13 10:10:07 +08:00
walkskyer
5571a0337a 修复stop words处理未考虑"\r"导致不能正常匹配的问题。 2014-11-12 22:33:27 +08:00
Sun Junyi
40c0edfd99 Merge pull request #198 from gumblex/jieba3k
Jieba3k 对应更新;半自动转换脚本
2014-11-08 22:17:51 +08:00
Dingyuan Wang
4a6140081e fix problems in auto2to3 2014-11-07 23:47:57 +08:00
Dingyuan Wang
7a6caa0c3c port extract_tags, etc to jieba3k; add auto2to3 script 2014-11-07 23:33:31 +08:00
walkskyer
36bc9e18c6 Merge pull request #1 from fxsjy/master
pull
2014-11-07 21:35:22 +08:00
Sun Junyi
7ce63e53b7 Merge pull request #197 from skyerown/master
修复带权重测试脚本输出结果是调用顺序错误
2014-11-07 11:07:19 +08:00
walkskyer
6772f0282e 修复带权重测试脚本输出结果是调用顺序错误 2014-11-06 22:24:43 +08:00
Sun Junyi
a5944bb88e Merge pull request #196 from qinwf/master
Add jiebaR in README
2014-11-04 12:29:42 +08:00
Qin Wenfeng
77a831b8c1 Add jiebaR in README 2014-11-04 11:59:40 +08:00
Sun Junyi
cf2aa88122 Merge pull request #195 from gumblex/master
统一获取关键词接口,优化缓存命名
2014-11-01 12:54:57 +08:00
Dingyuan Wang
751ff35eb5 improve extract_tags; unify extract_tags and testrank 2014-10-31 23:15:51 +08:00
Dingyuan Wang
e3f3dcccba improve the loading and caching process 2014-10-31 21:56:09 +08:00
Sun Junyi
4cb1924d09 Merge pull request #193 from gumblex/jieba3k
jieba3k 对应更新 #192
2014-10-25 15:29:49 +08:00
Sun Junyi
d6ef07a472 Merge pull request #192 from gumblex/master
更新、完善说明;命令行加入自定义词典功能
2014-10-25 15:29:26 +08:00
Dingyuan Wang
fd9f1f2c0e update README, textrank, etc. 2014-10-25 14:23:37 +08:00
Dingyuan Wang
9d2818b440 fix English part of README 2014-10-25 13:16:30 +08:00
Dingyuan Wang
31b7d11809 improve README 2014-10-25 13:12:19 +08:00
Dingyuan Wang
a6119cc995 add custom dictionary to __main__; update README; slightly optimize textrank 2014-10-25 12:59:36 +08:00
Sun Junyi
0049b0c5b4 Merge pull request #191 from sing1ee/master
add some introduction of textrank
2014-10-24 22:50:36 +08:00
zhangcheng
138d713e98 add some introduction of textrank 2014-10-24 22:41:51 +08:00
Sun Junyi
4030d8ed86 Merge pull request #190 from sing1ee/master
add a simple implementation of textrank
2014-10-24 22:20:05 +08:00
zhangcheng
6eb9f6149c add a simple implementation of textrank 2014-10-24 21:15:54 +08:00
Sun Junyi
1850bd6d37 Update README.md 2014-10-24 20:23:10 +08:00
fxsjy
f5ca87e088 merge change of @fukuball 2014-10-23 15:59:08 +08:00
Sun Junyi
10b86e90fb Update README.md 2014-10-21 12:53:37 +08:00
fxsjy
ba87fcb01f remove trie, use prefix set instead 2014-10-20 14:08:09 +08:00
fxsjy
82bfffb6ed version update to 0.34 2014-10-20 13:35:13 +08:00
Sun Junyi
56e8336af1 Merge pull request #188 from gumblex/jieba3k
不用Trie,同#187
2014-10-19 19:43:48 +08:00
Sun Junyi
4a93f21918 Merge pull request #187 from gumblex/master
不用Trie,减少内存加快速度;优化代码细节
2014-10-19 19:43:30 +08:00
Dingyuan Wang
bb1e6000c6 fix version; fix spaces at end of line 2014-10-19 10:57:46 +08:00
Dingyuan Wang
14671d4feb fix __main__.py 2014-10-19 10:41:09 +08:00
Dingyuan Wang
b367690eeb use prefix dict instead of trie, add a command line interface, and a few small improvements 2014-10-19 10:32:23 +08:00
Dingyuan Wang
51df77831b use prefix dict instead of trie, add a command line interface, and a few small improvements 2014-10-18 22:23:26 +08:00
fxsjy
eb98eb9248 fix performance problem of extrag_tags 2014-10-10 15:41:28 +08:00
Sun Junyi
7f965e0aa3 Merge pull request #184 from keroro520/master
fix issues 125 (https://github.com/fxsjy/jieba/issues/125)
2014-09-12 17:43:43 +08:00
keroro520
77b442fa88 fix issues (https://github.com/fxsjy/jieba/issues/125) 2014-09-12 13:42:05 +08:00
Sun Junyi
8f52419386 Merge pull request #183 from gumblex/jieba3k
Jieba3k update to v0.33
2014-09-09 10:52:31 +08:00
Dingyuan Wang
626b415152 fix dict.itervalues mistake 2014-09-07 19:21:13 +08:00
Dingyuan Wang
6a3f228c72 fix python3 stuff 2014-09-07 18:50:10 +08:00
Dingyuan Wang
b16cf0d63f fix indent typo 2014-09-06 23:37:54 +08:00
Dingyuan Wang
6fad5fbb2c update to v0.33 2014-09-06 23:28:47 +08:00
Sun Junyi
fc511de012 Merge pull request #176 from fukuball/master
更新 jieba 可以切換 idf 語料庫及 stop words 語料庫的說明
2014-09-01 14:11:00 +08:00
Sun Junyi
99ea59e88d Update README.md 2014-08-31 20:04:02 +08:00
fxsjy
6eb43acc10 pip install jieba3k 2014-08-31 20:01:54 +08:00
fxsjy
40adb1c591 version 0.33 2014-08-31 19:26:26 +08:00
Fukuball Lin
d432789cb4 fix typo 2014-08-06 17:56:05 +08:00
Fukuball Lin
cf31a99bf6 將 Readme 中文和半形的英文、數字、符號之間插入空白
將 Readme 中文和半形的英文、數字、符號之間插入空白,增加可讀性
2014-08-06 15:53:57 +08:00
Fukuball Lin
e4d323c78b 更新 jieba 可以切換 idf 語料庫及 stop words 語料庫的說明
更新 jieba 可以切換 idf 語料庫及 stop words 語料庫的說明
2014-08-06 15:00:07 +08:00
Sun Junyi
16d626d347 Merge pull request #174 from fukuball/master
讓 jieba 可以切換 idf 語料庫及 stop words 語料庫
2014-08-06 10:36:10 +08:00
Fukuball Lin
b658ee69cb 讓 jieba 可以自行增加 stop words 語料庫
1. 增加範例 stop words 語料庫
2. 為了讓 jieba 可以切換 stop words 語料庫,新增 set_stop_words 方法,並改寫 extract_tags
3. test 增加 extract_tags_stop_words.py 測試範例
2014-08-06 03:35:16 +08:00
Fukuball Lin
7198d562f1 讓 jieba 可以切換 idf 語料庫
1. 新增繁體中文 idf 語料庫
2. 為了讓 jieba 可以切換 iff 語料庫,新增 get_idf, set_idf_path 方法,並改寫 extract_tags
3. test 增加 extract_tags_idfpath
2014-08-05 22:55:13 +08:00
Sun Junyi
91e5b26f5f Merge pull request #165 from gumblex/jieba3k
fix the u'xxx' string.
2014-06-22 10:23:58 +08:00
Dingyuan Wang
8b07bce568 fix the u'xxx' string. 2014-06-21 23:30:06 +08:00
Sun Junyi
0d99ebce54 Merge pull request #164 from gumblex/jieba3k
Jieba3k v0.32 update
2014-06-15 19:14:28 +08:00
Dingyuan Wang
c04ccd0d12 Update to v0.32 according to the master branch. 2014-06-14 22:31:13 +08:00
Dingyuan Wang
81f77d7a08 Fix the re in enable_parallel. 2014-06-14 15:22:13 +08:00
Sun Junyi
473ac1df75 Merge pull request #162 from ShuraChow/master
fix issue #161
2014-06-11 17:04:23 +08:00
ShuraChow
7583f7760a fix issue #161
posseg每次根据jieba.user_word_tag_tab的长度判断是否有新词载入,如果有,则更新word_tag_tab,然后清空jieba.user_word_tag_tab
2014-06-10 02:04:09 +08:00
Sun Junyi
2726a7c89b Merge pull request #158 from davidlihm/patch-1
Thanks
2014-05-15 10:11:03 +08:00
davidlihm
5b2ec920ed Update __init__.py 2014-05-15 07:55:11 +08:00
Sun Junyi
5574304a9e Merge pull request #152 from jagt/jieba3k
close cache file to avoid warning message.
2014-04-29 11:16:41 +08:00
jagt
7f3513edb7 close cache file to avoid warning message. 2014-04-24 00:35:09 +08:00
Sun Junyi
28621e8b00 Update README.md 2014-04-17 13:47:47 +08:00
Sun Junyi
1f144ebf55 Merge pull request #141 from windch/jieba3k
use logging instead of print in __init__ file of py3k branch
2014-03-20 10:27:52 +08:00
wind
7488b114e7 use logging instead of print in init file 2014-03-20 13:48:33 +13:00
fxsjy
2682e887b8 Merge branch 'master' of https://github.com/fxsjy/jieba 2014-03-02 17:52:52 +08:00
fxsjy
9d4ac26f16 fix the bug of issue#137 2014-03-02 17:52:19 +08:00
Sun Junyi
6942795fae Merge pull request #135 from aszxqw/patch-1
add nodejieba into README.md
2014-02-26 14:13:00 +08:00
Yanyi Wu
ccfa54530e add nodejieba into README.md
add nodejieba into README.md
2014-02-26 14:05:13 +08:00
Sun Junyi
3e430e9769 Update __init__.py 2014-02-16 20:09:57 +08:00
Sun Junyi
6946b00f14 Merge pull request #134 from Honghe/master
Fix a bug about can not import ChineseAnalyzer
2014-02-16 20:08:42 +08:00
Honghe Wu
7720fbc1d8 fix a bug about can not import ChineseAnalyzer with change tab to 4 wihte spaces under PEP8 2014-02-15 19:32:29 +08:00
fxsjy
cc708de40c version 0.32 released 2014-02-07 15:22:53 +08:00
fxsjy
dafc73425e fix a little problem of dict.txt 2014-02-07 14:35:38 +08:00
fxsjy
7cc7e70843 Merge branch 'master' of https://github.com/fxsjy/jieba 2014-01-28 13:48:35 +08:00
fxsjy
18678d50c6 fix bug issue #132 2014-01-28 13:48:03 +08:00
Sun Junyi
62240c5add Merge pull request #131 from aholic/master
better indent
2014-01-25 18:17:50 -08:00
aholic
e2c796088f better indent 2014-01-24 00:43:48 +08:00
fxsjy
5e6a2c4661 fix a bug of add_word 2013-12-05 13:35:40 +08:00
fxsjy
136676381a fix a bug of add_word 2013-12-05 13:33:24 +08:00
Sun Junyi
e79d54b380 Merge pull request #114 from hermanschaaf/patch-1
Fix typo in error message
2013-10-23 03:41:20 -07:00
Herman Schaaf
95286b8887 Fix typo in error message 2013-10-21 22:21:09 +09:00
fxsjy
14a0ab0466 fix a bug in issue #111 2013-10-11 13:05:59 +08:00
fxsjy
759e1029c8 add an API to control log level: jieba.setLogLevel 2013-09-22 10:26:33 +08:00
Sun Junyi
2ef9dd3a70 Merge pull request #107 from mozillazg/logging
use logging instead of print
2013-09-21 18:54:34 -07:00
Mozillazg
1cf3f0d00b use logging instead of print 2013-09-19 10:31:44 +08:00
Sun Junyi
fd96527f71 Merge pull request #106 from jannson/master
add better support for english for ChineseAnalyzer
2013-09-16 23:58:46 -07:00
Sun Junyi
6a66620088 Update README.md 2013-09-14 22:32:45 +08:00
Sun Junyi
00bc72c877 Update README.md 2013-09-14 22:31:38 +08:00
gan
31d5845535 add better support for english. like input: 'this is interesting and interested me'-->output:'this interest interest',which 'interest' match 'interesting interested' 2013-09-09 11:54:30 +08:00
Sun Junyi
7e7fcc1184 add an option to disable HMM 2013-09-05 17:09:27 +08:00
fxsjy
21f7da0ca4 conver tab to spaces 2013-08-30 18:31:25 +08:00
fxsjy
c5bd9773d1 fix bug in issue #103 2013-08-30 18:26:53 +08:00
Sun Junyi
0125548a37 Merge pull request #101 from ZoeyYoung/jieba3k
Jieba3k
2013-08-21 18:34:22 -07:00
Zoey Young
510a3d6bed Merge pull request #1 from fxsjy/jieba3k
拉取
2013-08-21 04:47:12 -07:00
ZoeyYoung
25839b5127 fix bug 2013-08-21 19:46:14 +08:00
ZoeyYoung
ebd40ed65e Merge branch 'jieba3k' of https://github.com/ZoeyYoung/jieba into jieba3k 2013-08-21 19:31:30 +08:00
ZoeyYoung
d49542c06e fix bug 2013-08-21 19:31:12 +08:00
ZoeyYoung
6024497917 更新 2013-08-21 19:24:34 +08:00
Sun Junyi
835e68c585 fix bug of merge pull request 2013-08-21 16:01:49 +08:00
Sun Junyi
d16727ba89 Merge pull request #100 from ZoeyYoung/jieba3k
Jieba3k
2013-08-21 00:50:47 -07:00
ZoeyYoung
dce353f88b merge from master 2013-08-21 15:32:46 +08:00
ZoeyYoung
2857ae45cc Merge branch 'master' into jieba3k
Conflicts:
	Changelog
	jieba/__init__.py
	jieba/finalseg/__init__.py
	jieba/posseg/__init__.py
	setup.py
	test/parallel/test_file.py
	test/test_file.py
2013-08-21 13:55:21 +08:00
Sun Junyi
66e334229b Merge pull request #99 from aszxqw/branch1
sed -i 's/not \(.*\) in/\1 not in/g' ...
2013-08-20 18:33:39 -07:00
gwdwyy
cc81135429 sed -i 's/not \(.*\) in/\1 not in/g' ... 2013-08-20 20:08:03 +08:00
Sun Junyi
efebf5371c Merge branch 'master' of https://github.com/fxsjy/jieba 2013-08-09 13:59:38 +08:00
Sun Junyi
90ab511deb fix the bug about issue: #92 2013-08-09 13:59:02 +08:00
Sun Junyi
92c6c3d9cd Update README.md 2013-08-06 13:26:53 +08:00
Sun Junyi
0bb2ddcc1b Update README.md 2013-08-06 11:05:19 +08:00
Sun Junyi
81390a2d23 test_file.py: close the file object 2013-08-02 15:51:33 +08:00
Sun Junyi
3667a4ab01 include Changelog & README.md in the distribution package 2013-07-29 13:19:39 +08:00
Sun Junyi
33089138fd Merge branch 'master' of https://github.com/fxsjy/jieba 2013-07-29 12:48:04 +08:00
Sun Junyi
d0578ad99b add a license file 2013-07-29 12:47:47 +08:00
Sun Junyi
d97c1d584c 0.31 released
pypi update
2013-07-29 10:31:52 +08:00
fxsjy
b77645b3aa modify test_file.py; use less memory 2013-07-29 10:17:39 +08:00
fxsjy
ed1fa64e27 fix a bug. use sys.version_info.major can't be used in Python2.5 2013-07-29 10:07:55 +08:00
Sun Junyi
0f972df0ac raise exception in case of lower version 2013-07-29 10:01:47 +08:00
Sun Junyi
e68bb5a28e fix a compatibility problem;python2.5 has no 'multiprocessing'; 2013-07-29 09:57:09 +08:00
Sun Junyi
689e27280a Merge branch 'master' of https://github.com/fxsjy/jieba 2013-07-29 09:49:10 +08:00
Sun Junyi
9d87e798fd 0.31 release 2013-07-29 09:48:53 +08:00
Sun Junyi
4fad12017e Merge pull request #84 from linkerlin/master
自动检测CPU数目,启动合适数目的进程。
2013-07-27 20:35:03 -07:00
Linker Lin
5d83855088 自动检测CPU数目,启动合适数目的进程。 2013-07-28 00:12:00 +08:00
Linker Lin
1dbc525dff 自动检测CPU数目,启动合适数目的进程。 2013-07-28 00:10:27 +08:00
Linker Lin
2ceb981da0 自动检测CPU数目,启动合适数目的进程。 2013-07-28 00:07:29 +08:00
fxsjy
8e9b4bbe72 fix the compatibility with Python2.5 2013-07-25 10:25:24 +08:00
Sun Junyi
d4ede0fee6 hold the backward compatibility, let jython use a special loading workflow 2013-07-25 10:08:58 +08:00
Sun Junyi
8757148d51 Merge pull request #81 from piaolingxue/issue/80/jython_integration
serialize model to file so that it can support jython.
2013-07-24 17:00:05 -07:00
piaolignxue
aea8496b1f serialize model to file so that it can support jython. 2013-07-24 22:50:48 +08:00
Sun Junyi
6549deabbd merge change from master 2013-07-16 11:06:41 +08:00
Sun Junyi
d691d91674 fix a bug about ImportError 2013-07-15 09:32:52 +08:00
Sun Junyi
d63140fe5e make a serial white spaces seperated 2013-07-10 17:27:47 +08:00
Sun Junyi
a1ad2cbd55 Merge pull request #75 from chao787/feature_richard
Refactoring jieba/__init__.py
2013-07-10 01:34:43 -07:00
Richard Wong
c2ded83ead Refactor: fix line indent to 4.
* jieba/__init__.py (cut):
2013-07-10 16:22:49 +08:00
Richard Wong
99d2492d67 Add re.U flag to re variable. 2013-07-10 16:22:17 +08:00
Richard Wong
fbfaac2eaa Reindent function
* jieba/__init__.py (require_initialized):
2013-07-08 13:54:36 +08:00
Richard Wong
7bfd432fc5 Remove the unused imports. 2013-07-08 13:51:39 +08:00
Sun Junyi
7334bedf5c Merge pull request #74 from chgwei/jieba3k
Jieba3k
2013-07-05 20:34:12 -07:00
Cheng wei
6035bb6320 fix invalid syntax for python3 2013-07-06 02:52:17 +08:00
Cheng wei
27cf9cfd62 fix syntax invalid
* python3.2 not support unicode literal
* unicode regex as normal
2013-07-06 02:51:13 +08:00
Sun Junyi
9d0ea771a5 fix bug; decimals & digit-english mixed 2013-07-05 16:16:49 +08:00
Sun Junyi
ba5114dc95 update whoosh example 2013-07-04 09:31:09 +08:00
Sun Junyi
4b237f79fa add test/tmp/* into git ignore 2013-07-03 17:56:15 +08:00
Sun Junyi
f424862222 clean the files in tmp 2013-07-03 17:55:01 +08:00
Sun Junyi
b18d56d2a3 Merge pull request #72 from linkerlin/master
添加一个tmp目录,好让test_whoosh.py可以运行。
2013-07-03 02:52:46 -07:00
Sun Junyi
b9b1f1a418 fix conflict of merging 2013-07-03 17:47:45 +08:00
miao.lin
becd32b178 made test_whoosh.py happy.
添加一个tmp目录,好让test_whoosh.py可以运行。
2013-07-03 17:32:35 +08:00
Sun Junyi
c01680c6a8 merge the new file 2013-07-03 17:29:33 +08:00
Sun Junyi
b62f052927 PEP8 2013-07-03 17:21:21 +08:00
Sun Junyi
9ea14a8a54 merge chage from chao78787 2013-07-03 17:07:16 +08:00
Sun Junyi
45daf561c7 follow PEP8: change tab to 4 white spaces 2013-07-03 16:58:22 +08:00
Sun Junyi
632a086035 Merge pull request #71 from chao787/feature_richard
Separate cal and IO process.
2013-07-03 01:57:34 -07:00
Richard Wong
3246236133 Separate cal and IO process. 2013-07-03 15:03:45 +08:00
Sun Junyi
e1c1d46324 Update README.md 2013-07-01 12:43:33 +08:00
Sun Junyi
915b3164b0 Update README.md 2013-07-01 11:47:15 +08:00
Sun Junyi
45e6594a09 Update README.md 2013-07-01 11:46:16 +08:00
Sun Junyi
0886875af3 0.3 released 2013-07-01 11:36:16 +08:00
Sun Junyi
dbec3ad9df add some comments 2013-07-01 11:20:56 +08:00
Sun Junyi
efc784312c add ChineseAnalyzer for whoosh search engine 2013-07-01 10:53:39 +08:00
Sun Junyi
f08690a2df add 'search mode' for jieba.tokenize 2013-06-28 12:04:16 +08:00
Sun Junyi
237dc6625e add mix words to extra_dict/dict.txt.big 2013-06-26 09:36:41 +08:00
Sun Junyi
cb1b0499f7 unittest for jieba.tokenize 2013-06-24 16:20:04 +08:00
Sun Junyi
11a3b10755 new method: jieba.tokenize 2013-06-24 16:14:11 +08:00
Sun Junyi
8eab1cdb6d Merge branch 'master' of https://github.com/fxsjy/jieba 2013-06-24 13:48:30 +08:00
Sun Junyi
1a3be67691 make cache dumping more robust 2013-06-24 13:48:16 +08:00
Sun Junyi
465e475460 Update README.md 2013-06-24 12:24:50 +09:00
Sun Junyi
ca97b19951 merge change from master 2013-06-23 22:28:32 +08:00
Sun Junyi
38b6bcd54e remove some words 2013-06-23 21:52:22 +08:00
fxsjy
e1afafe353 fix a bug of cxfree support 2013-06-23 12:50:28 +08:00
fxsjy
a9f53e9c85 don't seprate CRLF 2013-06-22 21:56:39 +08:00
fxsjy
c015f4e297 support cxfree py2exe; keep white space 2013-06-22 21:24:45 +08:00
fxsjy
7343679ba8 fix a bug in parallel mode 2013-06-21 15:09:27 +08:00
Sun Junyi
c0816b9bb0 more mixed words 2013-06-18 18:09:55 +08:00
Sun Junyi
c9e8da9e63 add more mix words to dict.txt 2013-06-18 14:10:36 +08:00
Sun Junyi
322e8e48b6 Update Changelog 2013-06-17 10:31:31 +09:00
Sun Junyi
1d06f124d6 Update Changelog 2013-06-17 09:31:09 +08:00
Sun Junyi
dbfd0e0f63 minor version 2013-06-17 09:24:10 +08:00
Sun Junyi
cfcfb26792 Merge branch 'master' of https://github.com/fxsjy/jieba 2013-06-16 13:22:02 +08:00
Sun Junyi
9d1e23ce6f speed up the viterbi 2013-06-16 13:21:43 +08:00
Sun Junyi
b1238a2306 Update README.md 2013-06-14 13:01:07 +09:00
Sun Junyi
02e9a0328d Update README.md 2013-06-14 09:06:15 +08:00
Sun Junyi
b050bfe946 remove some useless words 2013-06-08 15:40:01 +08:00
fxsjy
08bfabb9d7 Merge branch 'jieba3k' of https://github.com/fxsjy/jieba into jieba3k 2013-06-08 11:30:07 +08:00
fxsjy
be1686654d merge master to jieba3k 2013-06-08 11:18:56 +08:00
fxsjy
69e584677a Merge branch 'master' of https://github.com/fxsjy/jieba 2013-06-08 10:48:11 +08:00
Sun Junyi
7993a3ea73 version 0.29 2013-06-07 18:23:19 +08:00
fxsjy
bdfaaa4eea Merge branch 'master' of https://code.csdn.net/fxsjy/jieba 2013-06-07 18:11:58 +08:00
fxsjy
1febdf847f clear 2013-06-07 18:11:11 +08:00
fxsjy
ffea881a46 second commit 2013-06-07 18:03:21 +08:00
979a9177ae first commit 2013-06-07 17:47:16 +08:00
fxsjy
e12e176d17 rollback, seems no abvious speed up by the previous change 2013-06-07 15:51:48 +08:00
fxsjy
d3531f197d rollback, seems no abvious speed up by the previous change 2013-06-07 15:51:13 +08:00
fxsjy
f2d6abf063 speed up of viterbi 2013-06-07 14:41:55 +08:00
fxsjy
0087a4e7e3 adjust prob_trans for better support of name entity; fix some bad cases 2013-06-07 13:59:36 +08:00
Sun Junyi
872d159b61 Update README.md 2013-06-04 14:33:46 +09:00
Sun Junyi
d4943f9072 minor version change 2013-05-31 13:43:16 +08:00
Sun Junyi
0bda20db82 Merge pull request #53 from cloudaice/devbranch
Don't lose nformation about a function when using a decorator
2013-05-22 18:18:11 -07:00
cloudaice
dfc807e65b Don't lose nformation about a function when using a decorator 2013-05-23 00:25:45 +02:00
项超
df8e0ab44d Merge pull request #6 from fxsjy/master
merge master from fxsjy
2013-05-21 03:57:12 -07:00
Sun Junyi
4300f79788 add a example of using sklearn+jieba 2013-05-17 09:35:12 +08:00
Sun Junyi
a8f902545c fix some bad cases 2013-05-15 18:21:08 +08:00
项超
c6fc94a2e8 Merge pull request #5 from fxsjy/master
merge master from fxsjy
2013-05-12 02:55:45 -07:00
Sun Junyi
afea4ca1ca Merge pull request #48 from cloudaice/cloudaice-dev
格式化了demo文件,添加了jieba_test.py单元测试文件
2013-05-11 20:06:58 -07:00
cloudaice
9ee20a5293 add generator test 2013-05-11 22:50:30 +02:00
cloudaice
0c050b5eb2 add jieba.posseg test case 2013-05-11 17:40:43 +02:00
cloudaice
b0f9e6721e 添加cutall 测试用例 2013-05-11 17:40:43 +02:00
cloudaice
a7ff398edc 添加cut,set_dictionary,cut_for_search三个测试用例 2013-05-11 17:40:43 +02:00
cloudaice
667203a9ae 替换tab为空格,使用join代替循环 2013-05-11 17:40:43 +02:00
cloudaice
a2d2078465 将tab换成空格,使用is判断对象是否为None 2013-05-11 17:40:42 +02:00
cloudaice
7ce5116a93 规范readme的示例代码 2013-05-11 17:40:42 +02:00
cloudaice
e0434871eb 修改demo.py的代码格式,使得符合pep8规范 2013-05-11 17:40:42 +02:00
项超
5e1ccf2086 Merge pull request #4 from fxsjy/master
sync code
2013-05-11 08:39:33 -07:00
项超
4a9f2d1e19 Merge pull request #3 from fxsjy/master
捕获明确的错误
2013-05-10 02:56:02 -07:00
Sun Junyi
37a179436f Merge pull request #46 from cloudaice/cloudaice-dev
明确声明处理的异常
2013-05-10 02:50:07 -07:00
cloudaice
9b0f60df93 Catch明确的错误 2013-05-10 11:26:27 +02:00
项超
65d07d2ddf Merge pull request #2 from fxsjy/master
使用更明确的表达
2013-05-10 02:25:32 -07:00
项超
c691a23084 Merge pull request #1 from fxsjy/master
使用更明确的表达
2013-05-10 02:24:27 -07:00
Sun Junyi
c2f4b04722 Merge pull request #45 from cloudaice/cloudaice-dev
使用更明确的表达
2013-05-10 02:20:29 -07:00
cloudaice
8ba8735f46 使用更明确的表达 2013-05-10 11:09:41 +02:00
Sun Junyi
c2ebfd8d00 Merge branch 'master' of https://github.com/fxsjy/jieba 2013-05-02 17:01:59 +08:00
Sun Junyi
c1bf815343 update test case 2013-05-02 17:01:16 +08:00
Sun Junyi
5cf9034625 Update README.md 2013-05-02 14:48:48 +08:00
Sun Junyi
a9f92e37ce Update Changelog 2013-05-02 11:39:22 +08:00
Sun Junyi
1cb721689c minor version 2013-05-02 11:37:13 +08:00
Sun Junyi
4eca1a2f47 Merge branch 'master' into jieba3k 2013-05-02 11:27:07 +08:00
Sun Junyi
ff4ea5d882 fix a bug of file leak 2013-05-02 11:24:22 +08:00
Sun Junyi
0e833cd441 fix a bug in py3k test case 2013-04-28 19:40:24 +08:00
Sun Junyi
de9e7f61c3 Merge branch 'master' into jieba3k 2013-04-28 19:32:14 +08:00
Sun Junyi
1275b3679f Merge branch 'master' of https://github.com/fxsjy/jieba 2013-04-28 12:04:32 +08:00
Sun Junyi
35aa38ed12 fix a bug caused by default argument binding 2013-04-28 12:04:16 +08:00
Sun Junyi
3c8913e0e0 Update README.md 2013-04-27 16:21:48 +08:00
Sun Junyi
273996f7d4 fix a test script in jieba3k 2013-04-27 16:18:40 +08:00
fxsjy
aae91b6fb6 merge change from master to jieba3k 2013-04-27 16:04:16 +08:00
fxsjy
2a2095e512 xx 2013-04-27 14:26:57 +08:00
Sun Junyi
ae15492257 Update README.md 2013-04-27 11:01:46 +08:00
Sun Junyi
da635859d4 Update README.md 2013-04-27 10:56:10 +08:00
Sun Junyi
9e4fce6b68 Update Changelog 2013-04-27 10:42:21 +08:00
Sun Junyi
1f51b2a3ff minor version chage 2013-04-27 10:26:12 +08:00
Sun Junyi
c1d143385f Merge branch 'master' of https://github.com/fxsjy/jieba 2013-04-27 10:23:17 +08:00
Sun Junyi
94d455b079 hot fix of cut_all=True 2013-04-27 10:23:01 +08:00
Sun Junyi
347a3a8034 Update Changelog 2013-04-27 10:10:32 +08:00
Sun Junyi
59d5d3b811 fix bug and change version 2013-04-27 09:45:39 +08:00
fxsjy
c8df565981 more log trace for trouble shooting 2013-04-26 17:43:24 +08:00
fxsjy
04eb4f08cf fix a bug of changing dictionary 2013-04-26 16:48:46 +08:00
fxsjy
8666428fb0 fix a bug of changing dictionary 2013-04-26 16:47:00 +08:00
fxsjy
9bebe6120b utf-8 output is more friendly to Linux 2013-04-26 16:19:00 +08:00
Sun Junyi
d3339633d5 in the speed test: initialize first to ignore the time of dict loading 2013-04-26 14:51:58 +08:00
fxsjy
bc049090a5 make lazy load thread safe 2013-04-26 12:54:05 +08:00
fxsjy
d2460029d5 merge lazy load 2013-04-26 09:57:06 +08:00
Herman Schaaf
7342a18534 Update readme in both languages with new functions 2013-04-25 21:46:15 +09:00
Herman Schaaf
c6098a8657 Add initialize function and lazy initialization 2013-04-25 21:04:56 +09:00
fxsjy
47d94a13e6 log(1)==0, since we have changed from PRODUCT to sum of LOG 2013-04-25 10:11:04 +08:00
fxsjy
c350fab2b9 fix wrong line number 2013-04-25 09:28:00 +08:00
fxsjy
65b78b2b4d read() and then split -- faster; from __future__ import with 2013-04-24 22:14:10 +08:00
Sun Junyi
966532b462 Merge pull request #39 from neuront/master
auto close file; locate error when failing to parse
2013-04-24 07:00:50 -07:00
Neuron Teckid
166c2ca7a5 auto close file; locate error when failing to parse 2013-04-24 19:01:08 +08:00
Sun Junyi
5f8435ce58 Update README.md 2013-04-22 15:57:36 +08:00
Sun Junyi
7337c6d420 Merge branch 'master' of https://github.com/fxsjy/jieba 2013-04-22 13:27:00 +08:00
Sun Junyi
ceae5c56d8 add changelog 2013-04-22 13:26:40 +08:00
Sun Junyi
604e6910e2 Update README.md 2013-04-22 13:08:23 +08:00
Sun Junyi
9af4d0a9d9 Update README.md 2013-04-22 12:49:54 +08:00
Sun Junyi
b06d6de174 Update README.md 2013-04-22 12:49:22 +08:00
Sun Junyi
f2fa585f3a Update README.md 2013-04-22 12:48:49 +08:00
Sun Junyi
825da757d0 Update README.md 2013-04-22 12:47:31 +08:00
Sun Junyi
1bb497ac09 version change 2013-04-22 12:37:02 +08:00
fxsjy
3f003e2f29 new method: jieba.disable_parallel, which is the inverse operation of jieba.enable_parallel 2013-04-22 12:35:17 +08:00
fxsjy
b46166f768 use CRLF as seperator to make chunks in parallel mode 2013-04-20 18:46:04 +08:00
fxsjy
6b83593b5a rm stub.log 2013-04-20 14:13:10 +08:00
fxsjy
62cf22121f new feature: parallel segment with multiprocessing 2013-04-20 14:11:31 +08:00
Sun Junyi
6da857b554 merge changes from master branch 2013-04-19 10:21:34 +08:00
Sun Junyi
8d89e8afda handle 的 2013-04-19 10:02:33 +08:00
Sun Junyi
012fddf13f ignore white space 2013-04-12 22:37:53 +08:00
fxsjy
45591bb9ab support flag '_'; ignore white space 2013-04-12 21:53:03 +08:00
Sun Junyi
c77823aa1d merge improvement to Py3k branch 2013-04-12 14:58:25 +08:00
Sun Junyi
afdcb8a77d Update README.md 2013-04-08 09:56:41 +08:00
Sun Junyi
94ad7e7035 support decimal point 2013-04-08 09:53:04 +08:00
Sun Junyi
72fff6c8e2 support decimal point 2013-04-08 09:40:32 +08:00
Sun Junyi
a383f035ba support decimal point: example PI=3.141569 = > PI / = / 3.14159 2013-04-08 09:38:49 +08:00
Sun Junyi
7ce3433316 fix bug: python2.6 does not support CRLF in eval(astring) 2013-04-07 22:55:06 +08:00
fxsjy
600a7fc285 CRLF to LF 2013-04-07 22:30:18 +08:00
fxsjy
ddeb766202 CRLF to LF 2013-04-07 22:29:39 +08:00
fxsjy
6632bb80ec CRLF to LF 2013-04-07 22:27:58 +08:00
fxsjy
f1d5d90ae6 CRLF to LF 2013-04-07 22:27:17 +08:00
Sun Junyi
fcb3747814 Update README.md 2013-04-07 11:03:54 +08:00
Sun Junyi
9fd2b38293 Update README.md 2013-04-07 11:02:49 +08:00
Sun Junyi
4a9193de4f Update README.md 2013-04-07 11:00:30 +08:00
Sun Junyi
a600868363 version change 2013-04-07 09:36:04 +08:00
Sun Junyi
659326c4e1 punctuation; improve keywords extraction 2013-04-06 14:02:11 +08:00
Sun Junyi
7d227da5c4 punctuation 2013-04-05 22:49:16 +08:00
Sun Junyi
8e49199993 keep punctuation marks 2013-04-05 21:48:36 +08:00
Sun Junyi
58c363655c support user defined word tag 2013-03-25 17:28:37 +08:00
Sun Junyi
44e19a2e27 fix bug in pypy 2013-03-22 15:20:19 +08:00
Sun Junyi
6cc0e95759 rm 1.log 2013-03-22 15:19:57 +08:00
Sun Junyi
d2634a049b fix a bug in pypy 2013-03-22 15:16:47 +08:00
Sun Junyi
0f4f9067c3 fix bugs in jieba for py3k 2013-03-21 11:10:57 +08:00
Sun Junyi
87c2799692 Update README.md 2013-02-18 10:55:19 +08:00
Sun Junyi
121a457e82 Update README.md 2013-02-18 10:54:36 +08:00
Sun Junyi
5e861921f2 version chage 2013-02-18 10:50:30 +08:00
Sun Junyi
8a699cf462 extra dictionary 2013-02-18 10:48:16 +08:00
Sun Junyi
d58402c8f6 for issue 26 2013-02-18 10:31:20 +08:00
Sun Junyi
981d58e106 for issue 26 2013-02-18 10:20:17 +08:00
Sun Junyi
182289c2eb for issue 25 2013-02-17 17:25:40 +08:00
Sun Junyi
13e3850ba8 try to solve this issue: https://github.com/fxsjy/jieba/issues/25 2013-02-17 17:06:47 +08:00
Sun Junyi
1edc1651ee try to fix this issue: https://github.com/fxsjy/jieba/issues/26 2013-02-17 16:04:51 +08:00
Sun Junyi
8d8e50fbf9 Merge branch 'master' of https://github.com/fxsjy/jieba 2012-12-28 11:30:35 +08:00
Sun Junyi
fd20cbbd4b use logarithmic addition instead of multiplication, to avoid bad case in issue19 2012-12-28 11:29:51 +08:00
Sun Junyi
f8f3db7cc4 Update README.md
use Appfog as demo site platform
2012-12-22 18:48:17 +08:00
Sun Junyi
263a9947bd Update README.md 2012-12-18 12:55:27 +08:00
Sun Junyi
88adb0c78e Update README.md 2012-12-12 22:21:30 +08:00
Sun Junyi
06ebc6f71c en-chn mix words in POS 2012-12-12 14:24:44 +08:00
Sun Junyi
a8ae0398b4 add one example 2012-12-12 13:40:22 +08:00
Sun Junyi
a879ac0db9 version change 2012-12-12 13:36:25 +08:00
Sun Junyi
8c875e80ae Merge branch 'master' of https://github.com/fxsjy/jieba 2012-12-12 11:05:07 +08:00
Sun Junyi
6517119110 remove 1.log 2012-12-12 11:04:35 +08:00
Sun Junyi
8c05efed68 remove tlbb.txt 2012-12-12 11:04:19 +08:00
Sun Junyi
379cd4933a support en-chn mixed words, like B超 2012-12-12 11:03:29 +08:00
Sun Junyi
2cbcd2d2a5 Update README.md 2012-11-28 11:31:47 +08:00
Sun Junyi
04d08f25d1 update doc: 2012-11-28 11:14:44 +08:00
Sun Junyi
9c07d80edb first py3k version of jieba 2012-11-28 10:50:40 +08:00
Sun Junyi
3f193540ca Update README.md 2012-11-27 15:31:58 +08:00
Sun Junyi
9f10122257 Update README.md 2012-11-27 14:08:09 +08:00
103 changed files with 1757216 additions and 126379 deletions

13
.gitignore vendored
View File

@ -113,8 +113,10 @@ Generated_Code #added for RIA/Silverlight projects
_UpgradeReport_Files/
Backup*/
UpgradeLog*.XML
############
## pycharm
############
.idea
############
## Windows
@ -161,3 +163,10 @@ pip-log.txt
# Mac crap
.DS_Store
*.log
test/tmp/*
#jython
*.class
MANIFEST

196
Changelog Normal file
View File

@ -0,0 +1,196 @@
2019-1-20: version 0.42.1
1. 修复setup.py在python2.7版本无法工作的问题 (issue #809)
2019-1-13: version 0.42
1. 修复paddle模式空字符串coredump问题 @JesseyXujin
2. 修复cut_all模式切分丢字问题 @fxsjy
3. paddle安装检测优化 @vissssa
2019-1-8: version 0.41
1. 开启paddle模式更友好
2. 修复cut_all模式不支持中英混合词的bug
2019-12-25: version 0.40
1. 支持基于paddle的深度学习分词模式(use_paddle=True); by @JesseyXujin, @xyzhou-puck
2. 修复自定义Tokenizer实例的add_word方法指向全局的问题; by @linhx13
3. 修复whoosh测试用例的引用bug; by @ZhengZixiang
4. 修复自定义词库不支持含"-"符号的问题by @JimCurryWang
2017-08-28: version 0.39
1. del_word支持强行拆开词语; by @gumblex,@fxsjy
2. 修复百分数的切词; by @fxsjy
3. 修复HMM=False在多进程模式下的bug; by @huntzhan
2015-12-16: version 0.38
1. 通过pkg_resources载入默认词典支持在Spark等平台上运行, by @gumblex;
2. 扩充识别的汉字unicode范围[\u4E00-\u9FD5], by @gumblex;
3. 关键词提取支持返回词性修复posseg分词得到的pair做dict关键字的问题by @jerryday
4. 修复load_userdict加载用户词典不能识别含有空格等特殊字符的问题 by @gumblex;
5. 命令行分词支持返回词性, by @gumblex;
2015-06-27: version 0.37
1. 代码重构分词器封装为Class支持实例化by @gumblex (https://github.com/fxsjy/jieba/commit/94840a734c32cfece05c0c3ec236ffc3d36b4ae6)
2. 修复cut_for_search的bug完善posseg by @gumblex
3. 修复posseg在0.36中引入的一处bug; by @wangbin
4. 修复load_userdict异常处理的bug; by @gip0
5. 修复生成词典二进制cache文件时跨文件系统的bug, 支持自定义; by @gumblex
2015-03-20: version 0.36
1. 代码同时兼容python2与python3, 若干性能优化; by @gumblex
2. 解决用户添加词的概率自动计算问题分词更加准确by @gumblex
3. 可自定义cache_file的文件系统路径; by @changyy
4. TextRank算法实现完善; by @sing1ee@walkskyer
2014-11-15: version 0.35.1
1. 修复 Python 3.2 的兼容性问题
2014-11-13: version 0.35
1. 改进词典cache的dump和加载机制by @gumblex
2. 提升关键词提取的性能; by @gumblex
3. 关键词提取新增基于textrank算法的子模块; by @singlee
4. 修复自定义stopwords功能的bug; by @walkskyer
2014-10-20: version 0.34
1. 提升性能词典结构由Trie改为Prefix Set内存占用减少2/3, 详见https://github.com/fxsjy/jieba/pull/187by @gumblex
2. 修复关键词提取功能的性能问题
2014-08-31: version 0.33
1. 支持自定义stop words; by @fukuball
2. 支持自定义idf词典; by @fukuball
3. 修复自定义词典的词性不能正常显示的bug; by @ShuraChow
2014-02-07: version 0.32
1. 新增分词选项可以关闭新词发现功能详见https://github.com/fxsjy/jieba/blob/master/test/test_no_hmm.py#L8
2. 修复posseg子模块的Bug详见: https://github.com/fxsjy/jieba/issues/111 https://github.com/fxsjy/jieba/issues/132
3. ChineseAnalyzer提供了更好的英文支持(感谢@jannson)例如单词Stemming 详见https://github.com/fxsjy/jieba/pull/106
2013-07-01: version 0.31
1. 修改了代码缩进格式遵循PEP8标准
2. 支持Jython解析器感谢 @piaolingxue
3. 修复中英混合词汇不能识别数字在前词语的Bug
4. 部分代码重构,感谢 @chao78787
5. 多进程并行分词模式下自动检测CPU个数设置合适的进程数感谢@linkerlin
6. 修复了0.3版中jieba.extra_tags方法对whoosh模块的错误依赖
2013-07-01: version 0.30
==========================
1) 新增jieba.tokenize方法返回每个词的起始位置
2) 新增ChineseAnalyzer用于支持whoosh搜索引擎
3添加了更多的中英混合词汇
4修改了一些py文件的加载方法从而支持py2exe,cxfree打包为exe
2013-06-17: version 0.29.1
==========================
1) 优化了viterbi算法的代码分词速度提升15%
2) 去除了词典中的一些低质词
2013-06-07: version 0.29
==========================
1) 提升了finalseg子模块命名体识别的准确度
2) 修正了一些badcase
2013-06-01: version 0.28.4
==========================
1) 修正了一些badcase
2) add wraps decorator, by @cloudaice
3) unittest, by @cloudaice
2013-05-02: version 0.28.3
==========================
1) 修正了临时cache文件生成在pypy解析器下出错的问题
2013-04-28: version 0.28.2
==========================
1) 修正了initialize函数默认参数绑定的bug.
2013-04-27: version 0.28
========================
1) 新增词典lazy load功能用户可以在'import jieba'后再改变词典的路径. 感谢hermanschaaf
2) 显示词典加载异常时错误的词条信息. 感谢neuront
3) 修正了词典被vim编辑后会加载失败的bug. 感谢neuront
2013-04-22: version 0.27
========================
1) 新增并行分词功能,可以在多核计算机上显著提高分词速度
2) 修正了“的”字频过高引起的bug修正了对小数点和下划线的处理
3) 修正了python2.6存在的兼容性问题
2013-04-07: version 0.26
========================
1) 改进了对标点符号的处理,之前的版本会过滤掉所有的标点符号;
2) 允许用户在自定义词典中添加词性;
3) 改进了关键词提取的功能jieba.analyse.extract_tags;
4) 修复了一个在pypy解释器下运行的bug.
2013-02-18: version 0.25
========================
1支持繁体中文的分词
2修正了多python进程时生成cache文件失败的bug
2012-12-28: version 0.24
========================
1) 解决了没有标点的长句子分词效果差的问题问题在于连续的小概率乘法可能会导致浮点下溢或为0.
2) 修复了0.23的全模式下英文分词的bug
2012-12-12: version 0.23
========================
1) 修复了之前版本不能识别中英混合词语的问题
2012-11-28: version 0.22
========================
1) 新增jieba.cut_for_search方法 该方法在精确分词的基础上对“长词”进行再次切分,适用于搜索引擎领域的分词,比精确分词模式有更高的召回率。
2) 开始支持Python3.x版。 之前一直是只支持Python2.x系列从这个版本起有一个单独的jieba3k
2012-11-23: version 0.21
========================
1) 修复了全模式分词中散字过多的问题
2) 用户自定义词典函数load_userdict支持file-like object作为输入
2012-11-06: version 0.20
========================
1) 新增词性标注功能
2012-10-25: version 0.19
========================
1) 提升了模块加载的速度
2) 增加了用户自定义词典的接口
2012-10-16: version 0.18
========================
1) 增加关键词提取功能
2012-10-12: version 0.17
========================
1 将词典文件dict.txt排序后存储提升了Trie树构建速度使得组件初始化时间缩短了10%;
2) 增强了人名词语的训练,增强了未登录人名词语的识别能力
2012-10-09: version 0.16
========================
1将求最优切分路径的记忆化递归搜索算法改用循环实现使分词速度提高了15%
2) 修复了Viterbi算法实现上的一个Bug
2012-10-07: version 0.14
========================
1) 结巴分词被发布到了pypi用户可以通过easy_install或者pip快速安装该组件
2) 合并了搜狗开源词库2006版删除了一些低频词
3) 优化了代码,缩短了程序初始化时间。
4) 增加了在线效果演示

20
LICENSE Normal file
View File

@ -0,0 +1,20 @@
The MIT License (MIT)
Copyright (c) 2013 Sun Junyi
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

2
MANIFEST.in Normal file
View File

@ -0,0 +1,2 @@
graft README.md
graft Changelog

821
README.md
View File

@ -1,125 +1,483 @@
jieba
========
"结巴"中文分词做最好的Python中文分词组件
“结巴”中文分词:做最好的 Python 中文分词组件
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.
- _Scroll down for English documentation._
Feature
========
* 支持两种分词模式:
* 1精确模式试图将句子最精确地切开适合文本分析
* 2全模式把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
* 3) 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
Usage
特点
========
* 全自动安装:`easy_install jieba` 或者 `pip install jieba`
* 半自动安装先下载http://pypi.python.org/pypi/jieba/ 解压后运行python setup.py install
* 手动安装将jieba目录放置于当前目录或者site-packages目录
* 通过import jieba 来引用 第一次import时需要构建Trie树需要几秒时间
* 支持四种分词模式:
* 精确模式,试图将句子最精确地切开,适合文本分析;
* 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
* 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
* paddle模式利用PaddlePaddle深度学习框架训练序列标注双向GRU网络模型实现分词。同时支持词性标注。paddle模式使用需安装paddlepaddle-tiny`pip install paddlepaddle-tiny==1.6.1`。目前paddle模式支持jieba v0.40及以上版本。jieba v0.40以下版本请升级jieba`pip install jieba --upgrade` 。[PaddlePaddle官网](https://www.paddlepaddle.org.cn/)
* 支持繁体分词
* 支持自定义词典
* MIT 授权协议
Algorithm
安装说明
=======
代码对 Python 2/3 均兼容
* 全自动安装:`easy_install jieba` 或者 `pip install jieba` / `pip3 install jieba`
* 半自动安装:先下载 http://pypi.python.org/pypi/jieba/ ,解压后运行 `python setup.py install`
* 手动安装:将 jieba 目录放置于当前目录或者 site-packages 目录
* 通过 `import jieba` 来引用
* 如果需要使用paddle模式下的分词和词性标注功能请先安装paddlepaddle-tiny`pip install paddlepaddle-tiny==1.6.1`
算法
========
* 基于Trie树结构实现高效的词图扫描生成句子中汉字所有可能成词情况所构成的有向无环图DAG)
* 基于前缀词典实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图 (DAG)
* 采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合
* 对于未登录词采用了基于汉字成词能力的HMM模型使用了Viterbi算法
* 对于未登录词,采用了基于汉字成词能力的 HMM 模型,使用了 Viterbi 算法
功能 1):分词
==========
* `jieba.cut`方法接受两个输入参数: 1) 第一个参数为需要分词的字符串 2cut_all参数用来控制是否采用全模式
* `jieba.cut_for_search`方法接受一个参数:需要分词的字符串,该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
* 注意待分词的字符串可以是gbk字符串、utf-8字符串或者unicode
* `jieba.cut`以及`jieba.cut_for_search`返回的结构都是一个可迭代的generator可以使用for循环来获得分词后得到的每一个词语(unicode)也可以用list(jieba.cut(...))转化为list
主要功能
=======
1. 分词
--------
* `jieba.cut` 方法接受四个输入参数: 需要分词的字符串cut_all 参数用来控制是否采用全模式HMM 参数用来控制是否使用 HMM 模型use_paddle 参数用来控制是否使用paddle模式下的分词模式paddle模式采用延迟加载方式通过enable_paddle接口安装paddlepaddle-tiny并且import相关代码
* `jieba.cut_for_search` 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
* 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意:不建议直接输入 GBK 字符串,可能无法预料地错误解码成 UTF-8
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator可以使用 for 循环来获得分词后得到的每一个词语(unicode),或者用
* `jieba.lcut` 以及 `jieba.lcut_for_search` 直接返回 list
* `jieba.Tokenizer(dictionary=DEFAULT_DICT)` 新建自定义分词器,可用于同时使用不同词典。`jieba.dt` 为默认分词器,所有全局分词相关函数都是该分词器的映射。
代码示例( 分词 )
代码示例
#encoding=utf-8
import jieba
```python
# encoding=utf-8
import jieba
seg_list = jieba.cut("我来到北京清华大学",cut_all=True)
print "Full Mode:", "/ ".join(seg_list) #全模式
jieba.enable_paddle()# 启动paddle模式。 0.40版之后开始支持,早期版本不支持
strs=["我来到北京清华大学","乒乓球拍卖完了","中国科学技术大学"]
for str in strs:
seg_list = jieba.cut(str,use_paddle=True) # 使用paddle模式
print("Paddle Mode: " + '/'.join(list(seg_list)))
seg_list = jieba.cut("我来到北京清华大学",cut_all=False)
print "Default Mode:", "/ ".join(seg_list) #精确模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("他来到了网易杭研大厦")
print ", ".join(seg_list)
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list)) # 精确模式
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式
print ", ".join(seg_list)
seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式
print(", ".join(seg_list))
Output:
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))
```
【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
输出:
【精确模式】: 我/ 来到/ 北京/ 清华大学
【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
【新词识别】:他, 来到, 了, 网易, 杭研, 大厦 (此处“杭研”并没有在词典中但是也被Viterbi算法识别出来了)
【精确模式】: 我/ 来到/ 北京/ 清华大学
【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在
, 日本, 京都, 大学, 日本京都大学, 深造
【新词识别】:他, 来到, 了, 网易, 杭研, 大厦 (此处“杭研”并没有在词典中但是也被Viterbi算法识别出来了)
功能 2) :添加自定义词典
================
【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
2. 添加自定义词典
----------------
### 载入词典
* 开发者可以指定自己自定义的词典,以便包含 jieba 词库里没有的词。虽然 jieba 有新词识别能力,但是自行添加新词可以保证更高的正确率
* 用法: jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
* 词典格式和 `dict.txt` 一样,一个词占一行;每一行分三部分:词语、词频(可省略)、词性(可省略),用空格隔开,顺序不可颠倒。`file_name` 若为路径或二进制方式打开的文件,则文件必须为 UTF-8 编码。
* 词频省略时使用自动计算的能保证分出该词的词频。
**例如:**
```
创新办 3 i
云计算 5
凱特琳 nz
台中
```
* 更改分词器(默认为 `jieba.dt`)的 `tmp_dir``cache_file` 属性,可分别指定缓存文件所在的文件夹及其文件名,用于受限的文件系统。
* 开发者可以指定自己自定义的词典以便包含jieba词库里没有的词。虽然jieba有新词识别能力但是自行添加新词可以保证更高的正确率
* 用法: jieba.load_userdict(file_name) # file_name为自定义词典的路径
* 词典格式和`analyse/idf.txt`一样,一个词占一行;每一行分为两部分,一部分为词语,另一部分为词频,用空格隔开
* 范例:
云计算 5
李小福 2
创新办 3
* 自定义词典https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
* 代码示例:"通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
* 之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
功能 3) :关键词提取
================
* jieba.analyse.extract_tags(sentence,topK) #需要先import jieba.analyse
* setence为待提取的文本
* topK为返回几个TF/IDF权重最大的关键词默认值为20
* 加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
### 调整词典
* 使用 `add_word(word, freq=None, tag=None)``del_word(word)` 可在程序中动态修改词典。
* 使用 `suggest_freq(segment, tune=True)` 可调节单个词语的词频,使其能(或不能)被分出来。
* 注意:自动计算的词频在使用 HMM 新词发现功能时可能无效。
代码示例:
```pycon
>>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
如果/放到/post/中将/出错/。
>>> jieba.suggest_freq(('中', '将'), True)
494
>>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
如果/放到/post/中/将/出错/。
>>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
「/台/中/」/正确/应该/不会/被/切开
>>> jieba.suggest_freq('台中', True)
69
>>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
「/台中/」/正确/应该/不会/被/切开
```
* "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
3. 关键词提取
-------------
### 基于 TF-IDF 算法的关键词抽取
`import jieba.analyse`
* jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
* sentence 为待提取的文本
* topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
* withWeight 为是否一并返回关键词权重值,默认值为 False
* allowPOS 仅包括指定词性的词,默认值为空,即不筛选
* jieba.analyse.TFIDF(idf_path=None) 新建 TFIDF 实例idf_path 为 IDF 频率文件
代码示例 (关键词提取)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
功能 4) : 词性标注
================
* 标注句子分词后每个词的词性采用和ictclas兼容的标记法
关键词提取所使用逆向文件频率IDF文本语料库可以切换成自定义语料库的路径
* 用法: jieba.analyse.set_idf_path(file_name) # file_name为自定义语料库的路径
* 自定义语料库示例https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py
关键词提取所使用停止词Stop Words文本语料库可以切换成自定义语料库的路径
* 用法: jieba.analyse.set_stop_words(file_name) # file_name为自定义语料库的路径
* 自定义语料库示例https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
关键词一并返回关键词权重值示例
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py
### 基于 TextRank 算法的关键词抽取
* jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')) 直接使用,接口相同,注意默认过滤词性。
* jieba.analyse.TextRank() 新建自定义 TextRank 实例
算法论文: [TextRank: Bringing Order into Texts](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
#### 基本思想:
1. 将待抽取关键词的文本进行分词
2. 以固定窗口大小(默认为5通过span属性调整),词之间的共现关系,构建图
3. 计算图中节点的PageRank注意是无向带权图
#### 使用示例:
见 [test/demo.py](https://github.com/fxsjy/jieba/blob/master/test/demo.py)
4. 词性标注
-----------
* `jieba.posseg.POSTokenizer(tokenizer=None)` 新建自定义分词器,`tokenizer` 参数可指定内部使用的 `jieba.Tokenizer` 分词器。`jieba.posseg.dt` 为默认词性标注分词器。
* 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法。
* 除了jieba默认分词模式提供paddle模式下的词性标注功能。paddle模式采用延迟加载方式通过enable_paddle()安装paddlepaddle-tiny并且import相关代码
* 用法示例
>>> import jieba.posseg as pseg
>>> words =pseg.cut("我爱北京天安门")
>>> for w in words:
... print w.word,w.flag
...
我 r
爱 v
北京 ns
天安门 ns
```pycon
>>> import jieba
>>> import jieba.posseg as pseg
>>> words = pseg.cut("我爱北京天安门") #jieba默认模式
>>> jieba.enable_paddle() #启动paddle模式。 0.40版之后开始支持,早期版本不支持
>>> words = pseg.cut("我爱北京天安门",use_paddle=True) #paddle模式
>>> for word, flag in words:
... print('%s %s' % (word, flag))
...
我 r
爱 v
北京 ns
天安门 ns
```
paddle模式词性标注对应表如下
paddle模式词性和专名类别标签集合如下表其中词性标签 24 个(小写字母),专名类别标签 4 个(大写字母)。
| 标签 | 含义 | 标签 | 含义 | 标签 | 含义 | 标签 | 含义 |
| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
| n | 普通名词 | f | 方位名词 | s | 处所名词 | t | 时间 |
| nr | 人名 | ns | 地名 | nt | 机构名 | nw | 作品名 |
| nz | 其他专名 | v | 普通动词 | vd | 动副词 | vn | 名动词 |
| a | 形容词 | ad | 副形词 | an | 名形词 | d | 副词 |
| m | 数量词 | q | 量词 | r | 代词 | p | 介词 |
| c | 连词 | u | 助词 | xc | 其他虚词 | w | 标点符号 |
| PER | 人名 | LOC | 地名 | ORG | 机构名 | TIME | 时间 |
5. 并行分词
-----------
* 原理:将目标文本按行分隔后,把各行文本分配到多个 Python 进程并行分词,然后归并结果,从而获得分词速度的可观提升
* 基于 python 自带的 multiprocessing 模块,目前暂不支持 Windows
* 用法:
* `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数
* `jieba.disable_parallel()` # 关闭并行分词模式
* 例子https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
* 实验结果:在 4 核 3.4GHz Linux 机器上,对金庸全集进行精确分词,获得了 1MB/s 的速度,是单进程版的 3.3 倍。
* **注意**:并行分词仅支持默认分词器 `jieba.dt``jieba.posseg.dt`
6. Tokenize返回词语在原文的起止位置
----------------------------------
* 注意,输入参数只接受 unicode
* 默认模式
```python
result = jieba.tokenize(u'永和服装饰品有限公司')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
```
```
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限公司 start: 6 end:10
```
* 搜索模式
```python
result = jieba.tokenize(u'永和服装饰品有限公司', mode='search')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
```
```
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限 start: 6 end:8
word 公司 start: 8 end:10
word 有限公司 start: 6 end:10
```
7. ChineseAnalyzer for Whoosh 搜索引擎
--------------------------------------------
* 引用: `from jieba.analyse import ChineseAnalyzer`
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
8. 命令行分词
-------------------
使用示例:`python -m jieba news.txt > cut_result.txt`
命令行选项(翻译):
使用: python -m jieba [options] filename
结巴命令行界面。
固定参数:
filename 输入文件
可选参数:
-h, --help 显示此帮助信息并退出
-d [DELIM], --delimiter [DELIM]
使用 DELIM 分隔词语,而不是用默认的' / '。
若不指定 DELIM则使用一个空格分隔。
-p [DELIM], --pos [DELIM]
启用词性标注;如果指定 DELIM词语和词性之间
用它分隔,否则用 _ 分隔
-D DICT, --dict DICT 使用 DICT 代替默认词典
-u USER_DICT, --user-dict USER_DICT
使用 USER_DICT 作为附加词典,与默认词典或自定义词典配合使用
-a, --cut-all 全模式分词(不支持词性标注)
-n, --no-hmm 不使用隐含马尔可夫模型
-q, --quiet 不输出载入信息到 STDERR
-V, --version 显示版本信息并退出
如果没有指定文件名,则使用标准输入。
`--help` 选项输出:
$> python -m jieba --help
Jieba command line interface.
positional arguments:
filename input file
optional arguments:
-h, --help show this help message and exit
-d [DELIM], --delimiter [DELIM]
use DELIM instead of ' / ' for word delimiter; or a
space if it is used without DELIM
-p [DELIM], --pos [DELIM]
enable POS tagging; if DELIM is specified, use DELIM
instead of '_' for POS delimiter
-D DICT, --dict DICT use DICT as dictionary
-u USER_DICT, --user-dict USER_DICT
use USER_DICT together with the default dictionary or
DICT (if specified)
-a, --cut-all full pattern cutting (ignored with POS tagging)
-n, --no-hmm don't use the Hidden Markov Model
-q, --quiet don't print loading messages to stderr
-V, --version show program's version number and exit
If no filename specified, use STDIN instead.
延迟加载机制
------------
jieba 采用延迟加载,`import jieba``jieba.Tokenizer()` 不会立即触发词典的加载,一旦有必要才开始加载词典构建前缀字典。如果你想手工初始 jieba也可以手动初始化。
import jieba
jieba.initialize() # 手动初始化(可选)
在 0.28 之前的版本是不能指定主词典的路径的,有了延迟加载机制后,你可以改变主词典的路径:
jieba.set_dictionary('data/dict.txt.big')
例子: https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py
其他词典
========
1. 占用内存较小的词典文件
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
2. 支持繁体分词更好的词典文件
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
下载你所需要的词典,然后覆盖 jieba/dict.txt 即可;或者用 `jieba.set_dictionary('data/dict.txt.big')`
其他语言实现
==========
结巴分词 Java 版本
----------------
作者piaolingxue
地址https://github.com/huaban/jieba-analysis
结巴分词 C++ 版本
----------------
作者yanyiwu
地址https://github.com/yanyiwu/cppjieba
结巴分词 Rust 版本
----------------
作者messense, MnO2
地址https://github.com/messense/jieba-rs
结巴分词 Node.js 版本
----------------
作者yanyiwu
地址https://github.com/yanyiwu/nodejieba
结巴分词 Erlang 版本
----------------
作者falood
地址https://github.com/falood/exjieba
结巴分词 R 版本
----------------
作者qinwf
地址https://github.com/qinwf/jiebaR
结巴分词 iOS 版本
----------------
作者yanyiwu
地址https://github.com/yanyiwu/iosjieba
结巴分词 PHP 版本
----------------
作者fukuball
地址https://github.com/fukuball/jieba-php
结巴分词 .NET(C#) 版本
----------------
作者anderscui
地址https://github.com/anderscui/jieba.NET/
结巴分词 Go 版本
----------------
+ 作者: wangbin 地址: https://github.com/wangbin/jiebago
+ 作者: yanyiwu 地址: https://github.com/yanyiwu/gojieba
结巴分词Android版本
------------------
+ 作者 Dongliang.W 地址https://github.com/452896915/jieba-android
友情链接
=========
* https://github.com/baidu/lac 百度中文词法分析(分词+词性+专名)系统
* https://github.com/baidu/AnyQ 百度FAQ自动问答系统
* https://github.com/baidu/Senta 百度情感识别系统
系统集成
========
1. Solr: https://github.com/sing1ee/jieba-solr
分词速度
=========
* 1.5 MB / Second in Full Mode
* 400 KB / Second in Default Mode
* Test Env: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz;《围城》.txt
在线演示
=========
http://209.222.69.242:9000/
* 测试环境: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz;《围城》.txt
常见问题
=========
1模型的数据是如何生成的https://github.com/fxsjy/jieba/issues/7
2这个库的授权是? https://github.com/fxsjy/jieba/issues/2
## 1. 模型的数据是如何生成的?
详见: https://github.com/fxsjy/jieba/issues/7
## 2. “台中”总是被切成“台 中”?(以及类似情况)
P(台中) P(台)×P(中),“台中”词频不够导致其成词概率较低
解决方法:强制调高词频
`jieba.add_word('台中')` 或者 `jieba.suggest_freq('台中', True)`
## 3. “今天天气 不错”应该被切成“今天 天气 不错”?(以及类似情况)
解决方法:强制调低词频
`jieba.suggest_freq(('今天', '天气'), True)`
或者直接删除该词 `jieba.del_word('今天天气')`
## 4. 切出了词典中没有的词语,效果不理想?
解决方法:关闭新词发现
`jieba.cut('丰田太省了', HMM=False)`
`jieba.cut('我们中出了一个叛徒', HMM=False)`
**更多问题请点击**https://github.com/fxsjy/jieba/issues?sort=updated&state=closed
修订历史
==========
https://github.com/fxsjy/jieba/blob/master/Changelog
--------------------
jieba
========
@ -128,85 +486,299 @@ jieba
Features
========
* Support three types of segmentation mode:
* 1) Accurate Mode, attempt to cut the sentence into the most accurate segmentation, which is suitable for text analysis;
* 2) Full Mode, break the words of the sentence into words scanned
* 3) Search Engine Mode, based on the Accurate Mode, with an attempt to cut the long words into several short words, which can enhance the recall rate
1. Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
2. Full Mode gets all the possible words from the sentence. Fast but not accurate.
3. Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
* Supports Traditional Chinese
* Supports customized dictionaries
* MIT License
Online demo
=========
http://jiebademo.ap01.aws.af.cm/
(Powered by Appfog)
Usage
========
* Fully automatic installation: `easy_install jieba` or `pip install jieba`
* Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/ , after extracting run `python setup.py install`
* Manutal installation: place the `jieba` directory in the current directory or python site-packages directory.
* Use `import jieba` to import, which will first build the Trie tree only on first import (takes a few seconds).
* Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/ , run `python setup.py install` after extracting.
* Manual installation: place the `jieba` directory in the current directory or python `site-packages` directory.
* `import jieba`.
Algorithm
========
* Based on the Trie tree structure to achieve efficient word graph scanning; sentences using Chinese characters constitute a directed acyclic graph (DAG)
* Employs memory search to calculate the maximum probability path, in order to identify the maximum tangential points based on word frequency combination
* For unknown words, the character position HMM-based model is used, using the Viterbi algorithm
* Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.
* Use dynamic programming to find the most probable combination based on the word frequency.
* For unknown words, a HMM-based model is used with the Viterbi algorithm.
Function 1): cut
==========
* The `jieba.cut` method accepts to input parameters: 1) the first parameter is the string that requires segmentation, and the 2) second parameter is `cut_all`, a parameter used to control the segmentation pattern.
* `jieba.cut` returned structure is an iterative generator, where you can use a `for` loop to get the word segmentation (in unicode), or `list(jieba.cut( ... ))` to create a list.
* `jieba.cut_for_search` accpets only on parameter: the string that requires segmentation, and it will cut the sentence into short words
Main Functions
==============
Code example: segmentation
==========
1. Cut
--------
* The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
* `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
* The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
* `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode).
* `jieba.lcut` and `jieba.lcut_for_search` returns a list.
* `jieba.Tokenizer(dictionary=DEFAULT_DICT)` creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. `jieba.dt` is the default Tokenizer, to which almost all global functions are mapped.
#encoding=utf-8
import jieba
seg_list = jieba.cut("我来到北京清华大学",cut_all=True)
print "Full Mode:", "/ ".join(seg_list) #全模式
**Code example: segmentation**
seg_list = jieba.cut("我来到北京清华大学",cut_all=False)
print "Default Mode:", "/ ".join(seg_list) #默认模式
```python
#encoding=utf-8
import jieba
seg_list = jieba.cut("他来到了网易杭研大厦")
print ", ".join(seg_list)
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式
print ", ".join(seg_list)
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list)) # 默认模式
seg_list = jieba.cut("他来到了网易杭研大厦")
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))
```
Output:
[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
[Accurate Mode]: 我/ 来到/ 北京/ 清华大学
[Accurate Mode]: 我/ 来到/ 北京/ 清华大学
[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)
[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)
[Search Engine Mode] 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在
, 日本, 京都, 大学, 日本京都大学, 深造
[Search Engine Mode] 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
Function 2): Add a custom dictionary
==========
2. Add a custom dictionary
----------------------------
* Developers can specify their own custom dictionary to include in the jieba thesaurus. jieba has the ability to identify new words, but adding your own new words can ensure a higher rate of correct segmentation.
* Usage `jieba.load_userdict(file_name) # file_name is a custom dictionary path`
* The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space
* Example
### Load dictionary
云计算 5
李小福 2
创新办 3
* Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but you can add your own new words can ensure a higher accuracy.
* Usage `jieba.load_userdict(file_name)` # file_name is a file-like object or the path of the custom dictionary
* The dictionary format is the same as that of `dict.txt`: one word per line; each line is divided into three parts separated by a space: word, word frequency, POS tag. If `file_name` is a path or a file opened in binary mode, the dictionary must be UTF-8 encoded.
* The word frequency and POS tag can be omitted respectively. The word frequency will be filled with a suitable value if omitted.
之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
**For example:**
加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
```
创新办 3 i
云计算 5
凱特琳 nz
台中
```
Function 3): Keyword Extraction
================
* `jieba.analyse.extract_tags(sentence,topK) # needs to first import jieba.analyse`
* `setence`: the text to be extracted
* `topK`: To return several TF / IDF weights for the biggest keywords, the default value is 20
Code sample (keyword extraction)
* Change a Tokenizer's `tmp_dir` and `cache_file` to specify the path of the cache file, for using on a restricted file system.
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
* Example:
云计算 5
李小福 2
创新办 3
[Before] 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
[After]: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
### Modify dictionary
* Use `add_word(word, freq=None, tag=None)` and `del_word(word)` to modify the dictionary dynamically in programs.
* Use `suggest_freq(segment, tune=True)` to adjust the frequency of a single word so that it can (or cannot) be segmented.
* Note that HMM may affect the final result.
Example:
```pycon
>>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
如果/放到/post/中将/出错/。
>>> jieba.suggest_freq(('中', '将'), True)
494
>>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
如果/放到/post/中/将/出错/。
>>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
「/台/中/」/正确/应该/不会/被/切开
>>> jieba.suggest_freq('台中', True)
69
>>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
「/台中/」/正确/应该/不会/被/切开
```
3. Keyword Extraction
-----------------------
`import jieba.analyse`
* `jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())`
* `sentence`: the text to be extracted
* `topK`: return how many keywords with the highest TF/IDF weights. The default value is 20
* `withWeight`: whether return TF/IDF weights with the keywords. The default value is False
* `allowPOS`: filter words with which POSs are included. Empty for no filtering.
* `jieba.analyse.TFIDF(idf_path=None)` creates a new TFIDF instance, `idf_path` specifies IDF file path.
Example (keyword extraction)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
Developers can specify their own custom IDF corpus in jieba keyword extraction
* Usage `jieba.analyse.set_idf_path(file_name) # file_name is the path for the custom corpus`
* Custom Corpus Samplehttps://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
* Sample Codehttps://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py
Developers can specify their own custom stop words corpus in jieba keyword extraction
* Usage `jieba.analyse.set_stop_words(file_name) # file_name is the path for the custom corpus`
* Custom Corpus Samplehttps://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
* Sample Codehttps://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
There's also a [TextRank](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) implementation available.
Use: `jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))`
Note that it filters POS by default.
`jieba.analyse.TextRank()` creates a new TextRank instance.
4. Part of Speech Tagging
-------------------------
* `jieba.posseg.POSTokenizer(tokenizer=None)` creates a new customized Tokenizer. `tokenizer` specifies the jieba.Tokenizer to internally use. `jieba.posseg.dt` is the default POSTokenizer.
* Tags the POS of each word after segmentation, using labels compatible with ictclas.
* Example:
```pycon
>>> import jieba.posseg as pseg
>>> words = pseg.cut("我爱北京天安门")
>>> for w in words:
... print('%s %s' % (w.word, w.flag))
...
我 r
爱 v
北京 ns
天安门 ns
```
5. Parallel Processing
----------------------
* Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
* Based on the multiprocessing module of Python.
* Usage:
* `jieba.enable_parallel(4)` # Enable parallel processing. The parameter is the number of processes.
* `jieba.disable_parallel()` # Disable parallel processing.
* Example:
https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
* Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
* **Note** that parallel processing supports only default tokenizers, `jieba.dt` and `jieba.posseg.dt`.
6. Tokenize: return words with position
----------------------------------------
* The input must be unicode
* Default mode
```python
result = jieba.tokenize(u'永和服装饰品有限公司')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
```
```
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限公司 start: 6 end:10
```
* Search mode
```python
result = jieba.tokenize(u'永和服装饰品有限公司',mode='search')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
```
```
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限 start: 6 end:8
word 公司 start: 8 end:10
word 有限公司 start: 6 end:10
```
7. ChineseAnalyzer for Whoosh
-------------------------------
* `from jieba.analyse import ChineseAnalyzer`
* Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
8. Command Line Interface
--------------------------------
$> python -m jieba --help
Jieba command line interface.
positional arguments:
filename input file
optional arguments:
-h, --help show this help message and exit
-d [DELIM], --delimiter [DELIM]
use DELIM instead of ' / ' for word delimiter; or a
space if it is used without DELIM
-p [DELIM], --pos [DELIM]
enable POS tagging; if DELIM is specified, use DELIM
instead of '_' for POS delimiter
-D DICT, --dict DICT use DICT as dictionary
-u USER_DICT, --user-dict USER_DICT
use USER_DICT together with the default dictionary or
DICT (if specified)
-a, --cut-all full pattern cutting (ignored with POS tagging)
-n, --no-hmm don't use the Hidden Markov Model
-q, --quiet don't print loading messages to stderr
-V, --version show program's version number and exit
If no filename specified, use STDIN instead.
Initialization
---------------
By default, Jieba don't build the prefix dictionary unless it's necessary. This takes 1-3 seconds, after which it is not initialized again. If you want to initialize Jieba manually, you can call:
import jieba
jieba.initialize() # (optional)
You can also specify the dictionary (not supported before version 0.28) :
jieba.set_dictionary('data/dict.txt.big')
Using Other Dictionaries
===========================
It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:
1. A smaller dictionary for a smaller memory footprint:
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
2. There is also a bigger dictionary that has better support for traditional Chinese (繁體):
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
By default, an in-between dictionary is used, called `dict.txt` and included in the distribution.
In either case, download the file you want, and then call `jieba.set_dictionary('data/dict.txt.big')` or just replace the existing `dict.txt`.
Segmentation speed
=========
@ -214,6 +786,3 @@ Segmentation speed
* 400 KB / Second in Default Mode
* Test Env: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz;《围城》.txt
Online demo
=========
http://209.222.69.242:9000/

584429
extra_dict/dict.txt.big Normal file

File diff suppressed because it is too large Load Diff

109750
extra_dict/dict.txt.small Normal file

File diff suppressed because it is too large Load Diff

176239
extra_dict/idf.txt.big Normal file

File diff suppressed because it is too large Load Diff

51
extra_dict/stop_words.txt Normal file
View File

@ -0,0 +1,51 @@
the
of
is
and
to
in
that
we
for
an
are
by
be
as
on
with
can
if
from
which
you
it
this
then
at
have
all
not
one
has
or
that
一個
沒有
我們
你們
妳們
他們
她們
是否

View File

@ -1,190 +1,619 @@
import re
import math
import os,sys
import pprint
import finalseg
import time
import tempfile
from __future__ import absolute_import, unicode_literals
__version__ = '0.42.1'
__license__ = 'MIT'
import marshal
import re
import tempfile
import threading
import time
from hashlib import md5
from math import log
FREQ = {}
total =0.0
from . import finalseg
from ._compat import *
def gen_trie(f_name):
lfreq = {}
trie = {}
ltotal = 0.0
content = open(f_name,'rb').read().decode('utf-8')
for line in content.split("\n"):
word,freq,_ = line.split(" ")
freq = float(freq)
lfreq[word] = freq
ltotal+=freq
p = trie
for c in word:
if not c in p:
p[c] ={}
p = p[c]
p['']='' #ending flag
return trie, lfreq,ltotal
if os.name == 'nt':
from shutil import move as _replace_file
else:
_replace_file = os.rename
_get_abs_path = lambda path: os.path.normpath(os.path.join(os.getcwd(), path))
DEFAULT_DICT = None
DEFAULT_DICT_NAME = "dict.txt"
log_console = logging.StreamHandler(sys.stderr)
default_logger = logging.getLogger(__name__)
default_logger.setLevel(logging.DEBUG)
default_logger.addHandler(log_console)
DICT_WRITING = {}
pool = None
re_userdict = re.compile('^(.+?)( [0-9]+)?( [a-z]+)?$', re.U)
re_eng = re.compile('[a-zA-Z0-9]', re.U)
# \u4E00-\u9FD5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
# \r\n|\s : whitespace characters. Will not be handled.
# re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
# Adding "-" symbol in re_han_default
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
re_skip_default = re.compile("(\r\n|\s)", re.U)
_curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) ) )
print >> sys.stderr, "Building Trie..."
t1 = time.time()
cache_file = os.path.join(tempfile.gettempdir(),"jieba.cache")
load_from_cache_fail = True
if os.path.exists(cache_file) and os.path.getmtime(cache_file)>os.path.getmtime(os.path.join(_curpath,"dict.txt")):
print >> sys.stderr, "loading model from cache"
try:
trie,FREQ,total,min_freq = marshal.load(open(cache_file,'rb'))
load_from_cache_fail = False
except:
load_from_cache_fail = True
if load_from_cache_fail:
trie,FREQ,total = gen_trie(os.path.join(_curpath,"dict.txt"))
FREQ = dict([(k,float(v)/total) for k,v in FREQ.iteritems()]) #normalize
min_freq = min(FREQ.itervalues())
print >> sys.stderr, "dumping model to file cache"
marshal.dump((trie,FREQ,total,min_freq),open(cache_file,'wb'))
print >> sys.stderr, "loading model cost ", time.time() - t1, "seconds."
print >> sys.stderr, "Trie has been built succesfully."
def setLogLevel(log_level):
default_logger.setLevel(log_level)
def __cut_all(sentence):
dag = get_DAG(sentence)
old_j = -1
for k,L in dag.iteritems():
if len(L)==1 and k>old_j:
yield sentence[k:L[0]+1]
old_j = L[0]
else:
for j in L:
if j>k:
yield sentence[k:j+1]
old_j = j
class Tokenizer(object):
def calc(sentence,DAG,idx,route):
N = len(sentence)
route[N] = (1.0,'')
for idx in xrange(N-1,-1,-1):
candidates = [ ( FREQ.get(sentence[idx:x+1],min_freq) * route[x+1][0],x ) for x in DAG[idx] ]
route[idx] = max(candidates)
def __init__(self, dictionary=DEFAULT_DICT):
self.lock = threading.RLock()
if dictionary == DEFAULT_DICT:
self.dictionary = dictionary
else:
self.dictionary = _get_abs_path(dictionary)
self.FREQ = {}
self.total = 0
self.user_word_tag_tab = {}
self.initialized = False
self.tmp_dir = None
self.cache_file = None
def get_DAG(sentence):
N = len(sentence)
i,j=0,0
p = trie
DAG = {}
while i<N:
c = sentence[j]
if c in p:
p = p[c]
if '' in p:
if not i in DAG:
DAG[i]=[]
DAG[i].append(j)
j+=1
if j>=N:
i+=1
j=i
p=trie
else:
p = trie
i+=1
j=i
for i in xrange(len(sentence)):
if not i in DAG:
DAG[i] =[i]
return DAG
def __repr__(self):
return '<Tokenizer dictionary=%r>' % self.dictionary
def __cut_DAG(sentence):
DAG = get_DAG(sentence)
route ={}
calc(sentence,DAG,0,route=route)
x = 0
buf =u''
N = len(sentence)
while x<N:
y = route[x][1]+1
l_word = sentence[x:y]
if y-x==1:
buf+= l_word
else:
if len(buf)>0:
if len(buf)==1:
yield buf
buf=u''
else:
regognized = finalseg.__cut(buf)
for t in regognized:
yield t
buf=u''
yield l_word
x =y
@staticmethod
def gen_pfdict(f):
lfreq = {}
ltotal = 0
f_name = resolve_filename(f)
for lineno, line in enumerate(f, 1):
try:
line = line.strip().decode('utf-8')
word, freq = line.split(' ')[:2]
freq = int(freq)
lfreq[word] = freq
ltotal += freq
for ch in xrange(len(word)):
wfrag = word[:ch + 1]
if wfrag not in lfreq:
lfreq[wfrag] = 0
except ValueError:
raise ValueError(
'invalid dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
f.close()
return lfreq, ltotal
if len(buf)>0:
if len(buf)==1:
yield buf
else:
regognized = finalseg.__cut(buf)
for t in regognized:
yield t
def initialize(self, dictionary=None):
if dictionary:
abs_path = _get_abs_path(dictionary)
if self.dictionary == abs_path and self.initialized:
return
else:
self.dictionary = abs_path
self.initialized = False
else:
abs_path = self.dictionary
with self.lock:
try:
with DICT_WRITING[abs_path]:
pass
except KeyError:
pass
if self.initialized:
return
default_logger.debug("Building prefix dict from %s ..." % (abs_path or 'the default dictionary'))
t1 = time.time()
if self.cache_file:
cache_file = self.cache_file
# default dictionary
elif abs_path == DEFAULT_DICT:
cache_file = "jieba.cache"
# custom dictionary
else:
cache_file = "jieba.u%s.cache" % md5(
abs_path.encode('utf-8', 'replace')).hexdigest()
cache_file = os.path.join(
self.tmp_dir or tempfile.gettempdir(), cache_file)
# prevent absolute path in self.cache_file
tmpdir = os.path.dirname(cache_file)
load_from_cache_fail = True
if os.path.isfile(cache_file) and (abs_path == DEFAULT_DICT or
os.path.getmtime(cache_file) > os.path.getmtime(abs_path)):
default_logger.debug(
"Loading model from cache %s" % cache_file)
try:
with open(cache_file, 'rb') as cf:
self.FREQ, self.total = marshal.load(cf)
load_from_cache_fail = False
except Exception:
load_from_cache_fail = True
if load_from_cache_fail:
wlock = DICT_WRITING.get(abs_path, threading.RLock())
DICT_WRITING[abs_path] = wlock
with wlock:
self.FREQ, self.total = self.gen_pfdict(self.get_dict_file())
default_logger.debug(
"Dumping model to file cache %s" % cache_file)
try:
# prevent moving across different filesystems
fd, fpath = tempfile.mkstemp(dir=tmpdir)
with os.fdopen(fd, 'wb') as temp_cache_file:
marshal.dump(
(self.FREQ, self.total), temp_cache_file)
_replace_file(fpath, cache_file)
except Exception:
default_logger.exception("Dump cache file failed.")
try:
del DICT_WRITING[abs_path]
except KeyError:
pass
self.initialized = True
default_logger.debug(
"Loading model cost %.3f seconds." % (time.time() - t1))
default_logger.debug("Prefix dict has been built successfully.")
def check_initialized(self):
if not self.initialized:
self.initialize()
def calc(self, sentence, DAG, route):
N = len(sentence)
route[N] = (0, 0)
logtotal = log(self.total)
for idx in xrange(N - 1, -1, -1):
route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) -
logtotal + route[x + 1][0], x) for x in DAG[idx])
def get_DAG(self, sentence):
self.check_initialized()
DAG = {}
N = len(sentence)
for k in xrange(N):
tmplist = []
i = k
frag = sentence[k]
while i < N and frag in self.FREQ:
if self.FREQ[frag]:
tmplist.append(i)
i += 1
frag = sentence[k:i + 1]
if not tmplist:
tmplist.append(k)
DAG[k] = tmplist
return DAG
def __cut_all(self, sentence):
dag = self.get_DAG(sentence)
old_j = -1
eng_scan = 0
eng_buf = u''
for k, L in iteritems(dag):
if eng_scan == 1 and not re_eng.match(sentence[k]):
eng_scan = 0
yield eng_buf
if len(L) == 1 and k > old_j:
word = sentence[k:L[0] + 1]
if re_eng.match(word):
if eng_scan == 0:
eng_scan = 1
eng_buf = word
else:
eng_buf += word
if eng_scan == 0:
yield word
old_j = L[0]
else:
for j in L:
if j > k:
yield sentence[k:j + 1]
old_j = j
if eng_scan == 1:
yield eng_buf
def __cut_DAG_NO_HMM(self, sentence):
DAG = self.get_DAG(sentence)
route = {}
self.calc(sentence, DAG, route)
x = 0
N = len(sentence)
buf = ''
while x < N:
y = route[x][1] + 1
l_word = sentence[x:y]
if re_eng.match(l_word) and len(l_word) == 1:
buf += l_word
x = y
else:
if buf:
yield buf
buf = ''
yield l_word
x = y
if buf:
yield buf
buf = ''
def __cut_DAG(self, sentence):
DAG = self.get_DAG(sentence)
route = {}
self.calc(sentence, DAG, route)
x = 0
buf = ''
N = len(sentence)
while x < N:
y = route[x][1] + 1
l_word = sentence[x:y]
if y - x == 1:
buf += l_word
else:
if buf:
if len(buf) == 1:
yield buf
buf = ''
else:
if not self.FREQ.get(buf):
recognized = finalseg.cut(buf)
for t in recognized:
yield t
else:
for elem in buf:
yield elem
buf = ''
yield l_word
x = y
if buf:
if len(buf) == 1:
yield buf
elif not self.FREQ.get(buf):
recognized = finalseg.cut(buf)
for t in recognized:
yield t
else:
for elem in buf:
yield elem
def cut(self, sentence, cut_all=False, HMM=True, use_paddle=False):
"""
The main function that segments an entire sentence that contains
Chinese characters into separated words.
Parameter:
- sentence: The str(unicode) to be segmented.
- cut_all: Model type. True for full pattern, False for accurate pattern.
- HMM: Whether to use the Hidden Markov Model.
"""
is_paddle_installed = check_paddle_install['is_paddle_installed']
sentence = strdecode(sentence)
if use_paddle and is_paddle_installed:
# if sentence is null, it will raise core exception in paddle.
if sentence is None or len(sentence) == 0:
return
import jieba.lac_small.predict as predict
results = predict.get_sent(sentence)
for sent in results:
if sent is None:
continue
yield sent
return
re_han = re_han_default
re_skip = re_skip_default
if cut_all:
cut_block = self.__cut_all
elif HMM:
cut_block = self.__cut_DAG
else:
cut_block = self.__cut_DAG_NO_HMM
blocks = re_han.split(sentence)
for blk in blocks:
if not blk:
continue
if re_han.match(blk):
for word in cut_block(blk):
yield word
else:
tmp = re_skip.split(blk)
for x in tmp:
if re_skip.match(x):
yield x
elif not cut_all:
for xx in x:
yield xx
else:
yield x
def cut_for_search(self, sentence, HMM=True):
"""
Finer segmentation for search engines.
"""
words = self.cut(sentence, HMM=HMM)
for w in words:
if len(w) > 2:
for i in xrange(len(w) - 1):
gram2 = w[i:i + 2]
if self.FREQ.get(gram2):
yield gram2
if len(w) > 3:
for i in xrange(len(w) - 2):
gram3 = w[i:i + 3]
if self.FREQ.get(gram3):
yield gram3
yield w
def lcut(self, *args, **kwargs):
return list(self.cut(*args, **kwargs))
def lcut_for_search(self, *args, **kwargs):
return list(self.cut_for_search(*args, **kwargs))
_lcut = lcut
_lcut_for_search = lcut_for_search
def _lcut_no_hmm(self, sentence):
return self.lcut(sentence, False, False)
def _lcut_all(self, sentence):
return self.lcut(sentence, True)
def _lcut_for_search_no_hmm(self, sentence):
return self.lcut_for_search(sentence, False)
def get_dict_file(self):
if self.dictionary == DEFAULT_DICT:
return get_module_res(DEFAULT_DICT_NAME)
else:
return open(self.dictionary, 'rb')
def load_userdict(self, f):
'''
Load personalized dict to improve detect rate.
Parameter:
- f : A plain text file contains words and their ocurrences.
Can be a file-like object, or the path of the dictionary file,
whose encoding must be utf-8.
Structure of dict file:
word1 freq1 word_type1
word2 freq2 word_type2
...
Word type may be ignored
'''
self.check_initialized()
if isinstance(f, string_types):
f_name = f
f = open(f, 'rb')
else:
f_name = resolve_filename(f)
for lineno, ln in enumerate(f, 1):
line = ln.strip()
if not isinstance(line, text_type):
try:
line = line.decode('utf-8').lstrip('\ufeff')
except UnicodeDecodeError:
raise ValueError('dictionary file %s must be utf-8' % f_name)
if not line:
continue
# match won't be None because there's at least one character
word, freq, tag = re_userdict.match(line).groups()
if freq is not None:
freq = freq.strip()
if tag is not None:
tag = tag.strip()
self.add_word(word, freq, tag)
def add_word(self, word, freq=None, tag=None):
"""
Add a word to dictionary.
freq and tag can be omitted, freq defaults to be a calculated value
that ensures the word can be cut out.
"""
self.check_initialized()
word = strdecode(word)
freq = int(freq) if freq is not None else self.suggest_freq(word, False)
self.FREQ[word] = freq
self.total += freq
if tag:
self.user_word_tag_tab[word] = tag
for ch in xrange(len(word)):
wfrag = word[:ch + 1]
if wfrag not in self.FREQ:
self.FREQ[wfrag] = 0
if freq == 0:
finalseg.add_force_split(word)
def del_word(self, word):
"""
Convenient function for deleting a word.
"""
self.add_word(word, 0)
def suggest_freq(self, segment, tune=False):
"""
Suggest word frequency to force the characters in a word to be
joined or splitted.
Parameter:
- segment : The segments that the word is expected to be cut into,
If the word should be treated as a whole, use a str.
- tune : If True, tune the word frequency.
Note that HMM may affect the final result. If the result doesn't change,
set HMM=False.
"""
self.check_initialized()
ftotal = float(self.total)
freq = 1
if isinstance(segment, string_types):
word = segment
for seg in self.cut(word, HMM=False):
freq *= self.FREQ.get(seg, 1) / ftotal
freq = max(int(freq * self.total) + 1, self.FREQ.get(word, 1))
else:
segment = tuple(map(strdecode, segment))
word = ''.join(segment)
for seg in segment:
freq *= self.FREQ.get(seg, 1) / ftotal
freq = min(int(freq * self.total), self.FREQ.get(word, 0))
if tune:
self.add_word(word, freq)
return freq
def tokenize(self, unicode_sentence, mode="default", HMM=True):
"""
Tokenize a sentence and yields tuples of (word, start, end)
Parameter:
- sentence: the str(unicode) to be segmented.
- mode: "default" or "search", "search" is for finer segmentation.
- HMM: whether to use the Hidden Markov Model.
"""
if not isinstance(unicode_sentence, text_type):
raise ValueError("jieba: the input parameter should be unicode.")
start = 0
if mode == 'default':
for w in self.cut(unicode_sentence, HMM=HMM):
width = len(w)
yield (w, start, start + width)
start += width
else:
for w in self.cut(unicode_sentence, HMM=HMM):
width = len(w)
if len(w) > 2:
for i in xrange(len(w) - 1):
gram2 = w[i:i + 2]
if self.FREQ.get(gram2):
yield (gram2, start + i, start + i + 2)
if len(w) > 3:
for i in xrange(len(w) - 2):
gram3 = w[i:i + 3]
if self.FREQ.get(gram3):
yield (gram3, start + i, start + i + 3)
yield (w, start, start + width)
start += width
def set_dictionary(self, dictionary_path):
with self.lock:
abs_path = _get_abs_path(dictionary_path)
if not os.path.isfile(abs_path):
raise Exception("jieba: file does not exist: " + abs_path)
self.dictionary = abs_path
self.initialized = False
def cut(sentence,cut_all=False):
if not ( type(sentence) is unicode):
try:
sentence = sentence.decode('utf-8')
except:
sentence = sentence.decode('gbk','ignore')
re_han, re_skip = re.compile(ur"([\u4E00-\u9FA5]+)"), re.compile(ur"[^a-zA-Z0-9+#\n]")
blocks = re_han.split(sentence)
cut_block = __cut_DAG
if cut_all:
cut_block = __cut_all
for blk in blocks:
if re_han.match(blk):
#pprint.pprint(__cut_DAG(blk))
for word in cut_block(blk):
yield word
else:
tmp = re_skip.split(blk)
for x in tmp:
if x!="":
yield x
# default Tokenizer instance
def cut_for_search(sentence):
words = cut(sentence)
for w in words:
if len(w)>2:
for i in xrange(len(w)-1):
gram2 = w[i:i+2]
if gram2 in FREQ:
yield gram2
if len(w)>3:
for i in xrange(len(w)-2):
gram3 = w[i:i+3]
if gram3 in FREQ:
yield gram3
yield w
dt = Tokenizer()
def load_userdict(f):
global trie,total,FREQ
if isinstance(f, (str, unicode)):
f = open(f, 'rb')
content = f.read().decode('utf-8')
for line in content.split("\n"):
if line.rstrip()=='': continue
word,freq = line.split(" ")
freq = float(freq)
FREQ[word] = freq / total
p = trie
for c in word:
if not c in p:
p[c] ={}
p = p[c]
p['']='' #ending flag
# global functions
get_FREQ = lambda k, d=None: dt.FREQ.get(k, d)
add_word = dt.add_word
calc = dt.calc
cut = dt.cut
lcut = dt.lcut
cut_for_search = dt.cut_for_search
lcut_for_search = dt.lcut_for_search
del_word = dt.del_word
get_DAG = dt.get_DAG
get_dict_file = dt.get_dict_file
initialize = dt.initialize
load_userdict = dt.load_userdict
set_dictionary = dt.set_dictionary
suggest_freq = dt.suggest_freq
tokenize = dt.tokenize
user_word_tag_tab = dt.user_word_tag_tab
def _lcut_all(s):
return dt._lcut_all(s)
def _lcut(s):
return dt._lcut(s)
def _lcut_no_hmm(s):
return dt._lcut_no_hmm(s)
def _lcut_all(s):
return dt._lcut_all(s)
def _lcut_for_search(s):
return dt._lcut_for_search(s)
def _lcut_for_search_no_hmm(s):
return dt._lcut_for_search_no_hmm(s)
def _pcut(sentence, cut_all=False, HMM=True):
parts = strdecode(sentence).splitlines(True)
if cut_all:
result = pool.map(_lcut_all, parts)
elif HMM:
result = pool.map(_lcut, parts)
else:
result = pool.map(_lcut_no_hmm, parts)
for r in result:
for w in r:
yield w
def _pcut_for_search(sentence, HMM=True):
parts = strdecode(sentence).splitlines(True)
if HMM:
result = pool.map(_lcut_for_search, parts)
else:
result = pool.map(_lcut_for_search_no_hmm, parts)
for r in result:
for w in r:
yield w
def enable_parallel(processnum=None):
"""
Change the module's `cut` and `cut_for_search` functions to the
parallel version.
Note that this only works using dt, custom Tokenizer
instances are not supported.
"""
global pool, dt, cut, cut_for_search
from multiprocessing import cpu_count
if os.name == 'nt':
raise NotImplementedError(
"jieba: parallel mode only supports posix system")
else:
from multiprocessing import Pool
dt.check_initialized()
if processnum is None:
processnum = cpu_count()
pool = Pool(processnum)
cut = _pcut
cut_for_search = _pcut_for_search
def disable_parallel():
global pool, dt, cut, cut_for_search
if pool:
pool.close()
pool = None
cut = dt.cut
cut_for_search = dt.cut_for_search

61
jieba/__main__.py Normal file
View File

@ -0,0 +1,61 @@
"""Jieba command line interface."""
import sys
import jieba
from argparse import ArgumentParser
from ._compat import *
parser = ArgumentParser(usage="%s -m jieba [options] filename" % sys.executable, description="Jieba command line interface.", epilog="If no filename specified, use STDIN instead.")
parser.add_argument("-d", "--delimiter", metavar="DELIM", default=' / ',
nargs='?', const=' ',
help="use DELIM instead of ' / ' for word delimiter; or a space if it is used without DELIM")
parser.add_argument("-p", "--pos", metavar="DELIM", nargs='?', const='_',
help="enable POS tagging; if DELIM is specified, use DELIM instead of '_' for POS delimiter")
parser.add_argument("-D", "--dict", help="use DICT as dictionary")
parser.add_argument("-u", "--user-dict",
help="use USER_DICT together with the default dictionary or DICT (if specified)")
parser.add_argument("-a", "--cut-all",
action="store_true", dest="cutall", default=False,
help="full pattern cutting (ignored with POS tagging)")
parser.add_argument("-n", "--no-hmm", dest="hmm", action="store_false",
default=True, help="don't use the Hidden Markov Model")
parser.add_argument("-q", "--quiet", action="store_true", default=False,
help="don't print loading messages to stderr")
parser.add_argument("-V", '--version', action='version',
version="Jieba " + jieba.__version__)
parser.add_argument("filename", nargs='?', help="input file")
args = parser.parse_args()
if args.quiet:
jieba.setLogLevel(60)
if args.pos:
import jieba.posseg
posdelim = args.pos
def cutfunc(sentence, _, HMM=True):
for w, f in jieba.posseg.cut(sentence, HMM):
yield w + posdelim + f
else:
cutfunc = jieba.cut
delim = text_type(args.delimiter)
cutall = args.cutall
hmm = args.hmm
fp = open(args.filename, 'r') if args.filename else sys.stdin
if args.dict:
jieba.initialize(args.dict)
else:
jieba.initialize()
if args.user_dict:
jieba.load_userdict(args.user_dict)
ln = fp.readline()
while ln:
l = ln.rstrip('\r\n')
result = delim.join(cutfunc(ln.rstrip('\r\n'), cutall, hmm))
if PY2:
result = result.encode(default_encoding)
print(result)
ln = fp.readline()
fp.close()

89
jieba/_compat.py Normal file
View File

@ -0,0 +1,89 @@
# -*- coding: utf-8 -*-
import logging
import os
import sys
log_console = logging.StreamHandler(sys.stderr)
default_logger = logging.getLogger(__name__)
default_logger.setLevel(logging.DEBUG)
def setLogLevel(log_level):
default_logger.setLevel(log_level)
check_paddle_install = {'is_paddle_installed': False}
try:
import pkg_resources
get_module_res = lambda *res: pkg_resources.resource_stream(__name__,
os.path.join(*res))
except ImportError:
get_module_res = lambda *res: open(os.path.normpath(os.path.join(
os.getcwd(), os.path.dirname(__file__), *res)), 'rb')
def enable_paddle():
try:
import paddle
except ImportError:
default_logger.debug("Installing paddle-tiny, please wait a minute......")
os.system("pip install paddlepaddle-tiny")
try:
import paddle
except ImportError:
default_logger.debug(
"Import paddle error, please use command to install: pip install paddlepaddle-tiny==1.6.1."
"Now, back to jieba basic cut......")
if paddle.__version__ < '1.6.1':
default_logger.debug("Find your own paddle version doesn't satisfy the minimum requirement (1.6.1), "
"please install paddle tiny by 'pip install --upgrade paddlepaddle-tiny', "
"or upgrade paddle full version by "
"'pip install --upgrade paddlepaddle (-gpu for GPU version)' ")
else:
try:
import jieba.lac_small.predict as predict
default_logger.debug("Paddle enabled successfully......")
check_paddle_install['is_paddle_installed'] = True
except ImportError:
default_logger.debug("Import error, cannot find paddle.fluid and jieba.lac_small.predict module. "
"Now, back to jieba basic cut......")
PY2 = sys.version_info[0] == 2
default_encoding = sys.getfilesystemencoding()
if PY2:
text_type = unicode
string_types = (str, unicode)
iterkeys = lambda d: d.iterkeys()
itervalues = lambda d: d.itervalues()
iteritems = lambda d: d.iteritems()
else:
text_type = str
string_types = (str,)
xrange = range
iterkeys = lambda d: iter(d.keys())
itervalues = lambda d: iter(d.values())
iteritems = lambda d: iter(d.items())
def strdecode(sentence):
if not isinstance(sentence, text_type):
try:
sentence = sentence.decode('utf-8')
except UnicodeDecodeError:
sentence = sentence.decode('gbk', 'ignore')
return sentence
def resolve_filename(f):
try:
return f.name
except AttributeError:
return repr(f)

42
jieba/analyse/__init__.py Normal file → Executable file
View File

@ -1,30 +1,18 @@
import jieba
import os
from __future__ import absolute_import
from .tfidf import TFIDF
from .textrank import TextRank
try:
from .analyzer import ChineseAnalyzer
except ImportError:
pass
_curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) ) )
f_name = os.path.join(_curpath,"idf.txt")
content = open(f_name,'rb').read().decode('utf-8')
default_tfidf = TFIDF()
default_textrank = TextRank()
idf_freq = {}
lines = content.split('\n')
for line in lines:
word,freq = line.split(' ')
idf_freq[word] = float(freq)
max_idf = max(idf_freq.values())
def extract_tags(sentence,topK=20):
words = jieba.cut(sentence)
freq = {}
for w in words:
if len(w.strip())<2: continue
freq[w]=freq.get(w,0.0)+1.0
total = sum(freq.values())
freq = [(k,v/total) for k,v in freq.iteritems()]
tf_idf_list = [(v * idf_freq.get(k,max_idf),k) for k,v in freq]
st_list = sorted(tf_idf_list,reverse=True)
top_tuples= st_list[:topK]
tags = [a[1] for a in top_tuples]
return tags
extract_tags = tfidf = default_tfidf.extract_tags
set_idf_path = default_tfidf.set_idf_path
textrank = default_textrank.extract_tags
def set_stop_words(stop_words_path):
default_tfidf.set_stop_words(stop_words_path)
default_textrank.set_stop_words(stop_words_path)

37
jieba/analyse/analyzer.py Normal file
View File

@ -0,0 +1,37 @@
# encoding=utf-8
from __future__ import unicode_literals
from whoosh.analysis import RegexAnalyzer, LowercaseFilter, StopFilter, StemFilter
from whoosh.analysis import Tokenizer, Token
from whoosh.lang.porter import stem
import jieba
import re
STOP_WORDS = frozenset(('a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'can',
'for', 'from', 'have', 'if', 'in', 'is', 'it', 'may',
'not', 'of', 'on', 'or', 'tbd', 'that', 'the', 'this',
'to', 'us', 'we', 'when', 'will', 'with', 'yet',
'you', 'your', '', '', ''))
accepted_chars = re.compile(r"[\u4E00-\u9FD5]+")
class ChineseTokenizer(Tokenizer):
def __call__(self, text, **kargs):
words = jieba.tokenize(text, mode="search")
token = Token()
for (w, start_pos, stop_pos) in words:
if not accepted_chars.match(w) and len(w) <= 1:
continue
token.original = token.text = w
token.pos = start_pos
token.startchar = start_pos
token.endchar = stop_pos
yield token
def ChineseAnalyzer(stoplist=STOP_WORDS, minsize=1, stemfn=stem, cachesize=50000):
return (ChineseTokenizer() | LowercaseFilter() |
StopFilter(stoplist=stoplist, minsize=minsize) |
StemFilter(stemfn=stemfn, ignore=None, cachesize=cachesize))

110
jieba/analyse/textrank.py Normal file
View File

@ -0,0 +1,110 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import sys
from operator import itemgetter
from collections import defaultdict
import jieba.posseg
from .tfidf import KeywordExtractor
from .._compat import *
class UndirectWeightedGraph:
d = 0.85
def __init__(self):
self.graph = defaultdict(list)
def addEdge(self, start, end, weight):
# use a tuple (start, end, weight) instead of a Edge object
self.graph[start].append((start, end, weight))
self.graph[end].append((end, start, weight))
def rank(self):
ws = defaultdict(float)
outSum = defaultdict(float)
wsdef = 1.0 / (len(self.graph) or 1.0)
for n, out in self.graph.items():
ws[n] = wsdef
outSum[n] = sum((e[2] for e in out), 0.0)
# this line for build stable iteration
sorted_keys = sorted(self.graph.keys())
for x in xrange(10): # 10 iters
for n in sorted_keys:
s = 0
for e in self.graph[n]:
s += e[2] / outSum[e[1]] * ws[e[1]]
ws[n] = (1 - self.d) + self.d * s
(min_rank, max_rank) = (sys.float_info[0], sys.float_info[3])
for w in itervalues(ws):
if w < min_rank:
min_rank = w
if w > max_rank:
max_rank = w
for n, w in ws.items():
# to unify the weights, don't *100.
ws[n] = (w - min_rank / 10.0) / (max_rank - min_rank / 10.0)
return ws
class TextRank(KeywordExtractor):
def __init__(self):
self.tokenizer = self.postokenizer = jieba.posseg.dt
self.stop_words = self.STOP_WORDS.copy()
self.pos_filt = frozenset(('ns', 'n', 'vn', 'v'))
self.span = 5
def pairfilter(self, wp):
return (wp.flag in self.pos_filt and len(wp.word.strip()) >= 2
and wp.word.lower() not in self.stop_words)
def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False):
"""
Extract keywords from sentence using TextRank algorithm.
Parameter:
- topK: return how many top keywords. `None` for all possible words.
- withWeight: if True, return a list of (word, weight);
if False, return a list of words.
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
if the POS of w is not in this list, it will be filtered.
- withFlag: if True, return a list of pair(word, weight) like posseg.cut
if False, return a list of words
"""
self.pos_filt = frozenset(allowPOS)
g = UndirectWeightedGraph()
cm = defaultdict(int)
words = tuple(self.tokenizer.cut(sentence))
for i, wp in enumerate(words):
if self.pairfilter(wp):
for j in xrange(i + 1, i + self.span):
if j >= len(words):
break
if not self.pairfilter(words[j]):
continue
if allowPOS and withFlag:
cm[(wp, words[j])] += 1
else:
cm[(wp.word, words[j].word)] += 1
for terms, w in cm.items():
g.addEdge(terms[0], terms[1], w)
nodes_rank = g.rank()
if withWeight:
tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True)
else:
tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)
if topK:
return tags[:topK]
else:
return tags
extract_tags = textrank

116
jieba/analyse/tfidf.py Executable file
View File

@ -0,0 +1,116 @@
# encoding=utf-8
from __future__ import absolute_import
import os
import jieba
import jieba.posseg
from operator import itemgetter
_get_module_path = lambda path: os.path.normpath(os.path.join(os.getcwd(),
os.path.dirname(__file__), path))
_get_abs_path = jieba._get_abs_path
DEFAULT_IDF = _get_module_path("idf.txt")
class KeywordExtractor(object):
STOP_WORDS = set((
"the", "of", "is", "and", "to", "in", "that", "we", "for", "an", "are",
"by", "be", "as", "on", "with", "can", "if", "from", "which", "you", "it",
"this", "then", "at", "have", "all", "not", "one", "has", "or", "that"
))
def set_stop_words(self, stop_words_path):
abs_path = _get_abs_path(stop_words_path)
if not os.path.isfile(abs_path):
raise Exception("jieba: file does not exist: " + abs_path)
content = open(abs_path, 'rb').read().decode('utf-8')
for line in content.splitlines():
self.stop_words.add(line)
def extract_tags(self, *args, **kwargs):
raise NotImplementedError
class IDFLoader(object):
def __init__(self, idf_path=None):
self.path = ""
self.idf_freq = {}
self.median_idf = 0.0
if idf_path:
self.set_new_path(idf_path)
def set_new_path(self, new_idf_path):
if self.path != new_idf_path:
self.path = new_idf_path
content = open(new_idf_path, 'rb').read().decode('utf-8')
self.idf_freq = {}
for line in content.splitlines():
word, freq = line.strip().split(' ')
self.idf_freq[word] = float(freq)
self.median_idf = sorted(
self.idf_freq.values())[len(self.idf_freq) // 2]
def get_idf(self):
return self.idf_freq, self.median_idf
class TFIDF(KeywordExtractor):
def __init__(self, idf_path=None):
self.tokenizer = jieba.dt
self.postokenizer = jieba.posseg.dt
self.stop_words = self.STOP_WORDS.copy()
self.idf_loader = IDFLoader(idf_path or DEFAULT_IDF)
self.idf_freq, self.median_idf = self.idf_loader.get_idf()
def set_idf_path(self, idf_path):
new_abs_path = _get_abs_path(idf_path)
if not os.path.isfile(new_abs_path):
raise Exception("jieba: file does not exist: " + new_abs_path)
self.idf_loader.set_new_path(new_abs_path)
self.idf_freq, self.median_idf = self.idf_loader.get_idf()
def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=(), withFlag=False):
"""
Extract keywords from sentence using TF-IDF algorithm.
Parameter:
- topK: return how many top keywords. `None` for all possible words.
- withWeight: if True, return a list of (word, weight);
if False, return a list of words.
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
if the POS of w is not in this list,it will be filtered.
- withFlag: only work with allowPOS is not empty.
if True, return a list of pair(word, weight) like posseg.cut
if False, return a list of words
"""
if allowPOS:
allowPOS = frozenset(allowPOS)
words = self.postokenizer.cut(sentence)
else:
words = self.tokenizer.cut(sentence)
freq = {}
for w in words:
if allowPOS:
if w.flag not in allowPOS:
continue
elif not withFlag:
w = w.word
wc = w.word if allowPOS and withFlag else w
if len(wc.strip()) < 2 or wc.lower() in self.stop_words:
continue
freq[w] = freq.get(w, 0.0) + 1.0
total = sum(freq.values())
for k in freq:
kw = k.word if allowPOS and withFlag else k
freq[k] *= self.idf_freq.get(kw, self.median_idf) / total
if withWeight:
tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
else:
tags = sorted(freq, key=freq.__getitem__, reverse=True)
if topK:
return tags[:topK]
else:
return tags

File diff suppressed because it is too large Load Diff

View File

@ -1,68 +1,100 @@
from __future__ import absolute_import, unicode_literals
import re
import os
import sys
import pickle
from .._compat import *
def load_model(f_name):
_curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) ) )
prob_p_path = os.path.join(_curpath,f_name)
return eval(open(prob_p_path,"rb").read())
MIN_FLOAT = -3.14e100
prob_start = load_model("prob_start.py")
prob_trans = load_model("prob_trans.py")
prob_emit = load_model("prob_emit.py")
PROB_START_P = "prob_start.p"
PROB_TRANS_P = "prob_trans.p"
PROB_EMIT_P = "prob_emit.p"
PrevStatus = {
'B': 'ES',
'M': 'MB',
'S': 'SE',
'E': 'BM'
}
Force_Split_Words = set([])
def load_model():
start_p = pickle.load(get_module_res("finalseg", PROB_START_P))
trans_p = pickle.load(get_module_res("finalseg", PROB_TRANS_P))
emit_p = pickle.load(get_module_res("finalseg", PROB_EMIT_P))
return start_p, trans_p, emit_p
if sys.platform.startswith("java"):
start_P, trans_P, emit_P = load_model()
else:
from .prob_start import P as start_P
from .prob_trans import P as trans_P
from .prob_emit import P as emit_P
def viterbi(obs, states, start_p, trans_p, emit_p):
V = [{}] #tabular
path = {}
for y in states: #init
V[0][y] = start_p[y] * emit_p[y].get(obs[0],0)
path[y] = [y]
for t in range(1,len(obs)):
V.append({})
newpath = {}
for y in states:
(prob,state ) = max([(V[t-1][y0] * trans_p[y0].get(y,0) * emit_p[y].get(obs[t],0) ,y0) for y0 in states ])
V[t][y] =prob
newpath[y] = path[state] + [y]
path = newpath
V = [{}] # tabular
path = {}
for y in states: # init
V[0][y] = start_p[y] + emit_p[y].get(obs[0], MIN_FLOAT)
path[y] = [y]
for t in xrange(1, len(obs)):
V.append({})
newpath = {}
for y in states:
em_p = emit_p[y].get(obs[t], MIN_FLOAT)
(prob, state) = max(
[(V[t - 1][y0] + trans_p[y0].get(y, MIN_FLOAT) + em_p, y0) for y0 in PrevStatus[y]])
V[t][y] = prob
newpath[y] = path[state] + [y]
path = newpath
(prob, state) = max([(V[len(obs) - 1][y], y) for y in ('E','S')])
(prob, state) = max((V[len(obs) - 1][y], y) for y in 'ES')
return (prob, path[state])
return (prob, path[state])
def __cut(sentence):
prob, pos_list = viterbi(sentence,('B','M','E','S'), prob_start, prob_trans, prob_emit)
begin, next = 0,0
#print pos_list, sentence
for i,char in enumerate(sentence):
pos = pos_list[i]
if pos=='B':
begin = i
elif pos=='E':
yield sentence[begin:i+1]
next = i+1
elif pos=='S':
yield char
next = i+1
if next<len(sentence):
yield sentence[next:]
global emit_P
prob, pos_list = viterbi(sentence, 'BMES', start_P, trans_P, emit_P)
begin, nexti = 0, 0
# print pos_list, sentence
for i, char in enumerate(sentence):
pos = pos_list[i]
if pos == 'B':
begin = i
elif pos == 'E':
yield sentence[begin:i + 1]
nexti = i + 1
elif pos == 'S':
yield char
nexti = i + 1
if nexti < len(sentence):
yield sentence[nexti:]
re_han = re.compile("([\u4E00-\u9FD5]+)")
re_skip = re.compile("([a-zA-Z0-9]+(?:\.\d+)?%?)")
def add_force_split(word):
global Force_Split_Words
Force_Split_Words.add(word)
def cut(sentence):
if not ( type(sentence) is unicode):
try:
sentence = sentence.decode('utf-8')
except:
sentence = sentence.decode('gbk','ignore')
re_han, re_skip = re.compile(ur"([\u4E00-\u9FA5]+)"), re.compile(ur"[^a-zA-Z0-9+#\n]")
blocks = re_han.split(sentence)
for blk in blocks:
if re_han.match(blk):
for word in __cut(blk):
yield word
else:
tmp = re_skip.split(blk)
for x in tmp:
if x!="":
yield x
sentence = strdecode(sentence)
blocks = re_han.split(sentence)
for blk in blocks:
if re_han.match(blk):
for word in __cut(blk):
if word not in Force_Split_Words:
yield word
else:
for c in word:
yield c
else:
tmp = re_skip.split(blk)
for x in tmp:
if x:
yield x

105686
jieba/finalseg/prob_emit.p Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,14 @@
(dp0
S'B'
p1
F-0.26268660809250016
sS'E'
p2
F-3.14e+100
sS'M'
p3
F-3.14e+100
sS'S'
p4
F-1.4652633398537678
s.

View File

@ -1 +1,4 @@
{'B': 0.7689828525554734, 'E': 0.0, 'M': 0.0, 'S': 0.23101714744452656}
P={'B': -0.26268660809250016,
'E': -3.14e+100,
'M': -3.14e+100,
'S': -1.4652633398537678}

View File

@ -0,0 +1,30 @@
(dp0
S'B'
p1
(dp2
S'E'
p3
F-0.51082562376599
sS'M'
p4
F-0.916290731874155
ssg3
(dp5
g1
F-0.5897149736854513
sS'S'
p6
F-0.8085250474669937
ssg4
(dp7
g3
F-0.33344856811948514
sg4
F-1.2603623820268226
ssg6
(dp8
g1
F-0.7211965654669841
sg6
F-0.6658631448798212
ss.

View File

@ -1,4 +1,4 @@
{'B': {'E': 0.8518218565181658, 'M': 0.14817814348183422},
'E': {'B': 0.5544853051164425, 'S': 0.44551469488355755},
'M': {'E': 0.7164487459986911, 'M': 0.2835512540013088},
'S': {'B': 0.48617017333894563, 'S': 0.5138298266610544}}
P={'B': {'E': -0.510825623765990, 'M': -0.916290731874155},
'E': {'B': -0.5897149736854513, 'S': -0.8085250474669937},
'M': {'E': -0.33344856811948514, 'M': -1.2603623820268226},
'S': {'B': -0.7211965654669841, 'S': -0.6658631448798212}}

View File

View File

@ -0,0 +1,46 @@
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Define the function to create lexical analysis model and model's data reader
"""
import sys
import os
import math
import paddle
import paddle.fluid as fluid
from paddle.fluid.initializer import NormalInitializer
import jieba.lac_small.nets as nets
def create_model(vocab_size, num_labels, mode='train'):
"""create lac model"""
# model's input data
words = fluid.data(name='words', shape=[-1, 1], dtype='int64', lod_level=1)
targets = fluid.data(
name='targets', shape=[-1, 1], dtype='int64', lod_level=1)
# for inference process
if mode == 'infer':
crf_decode = nets.lex_net(
words, vocab_size, num_labels, for_infer=True, target=None)
return {
"feed_list": [words],
"words": words,
"crf_decode": crf_decode,
}
return ret

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

122
jieba/lac_small/nets.py Normal file
View File

@ -0,0 +1,122 @@
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
The function lex_net(args) define the lexical analysis network structure
"""
import sys
import os
import math
import paddle.fluid as fluid
from paddle.fluid.initializer import NormalInitializer
def lex_net(word, vocab_size, num_labels, for_infer=True, target=None):
"""
define the lexical analysis network structure
word: stores the input of the model
for_infer: a boolean value, indicating if the model to be created is for training or predicting.
return:
for infer: return the prediction
otherwise: return the prediction
"""
word_emb_dim=128
grnn_hidden_dim=128
bigru_num=2
emb_lr = 1.0
crf_lr = 1.0
init_bound = 0.1
IS_SPARSE = True
def _bigru_layer(input_feature):
"""
define the bidirectional gru layer
"""
pre_gru = fluid.layers.fc(
input=input_feature,
size=grnn_hidden_dim * 3,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound),
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
gru = fluid.layers.dynamic_gru(
input=pre_gru,
size=grnn_hidden_dim,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound),
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
pre_gru_r = fluid.layers.fc(
input=input_feature,
size=grnn_hidden_dim * 3,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound),
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
gru_r = fluid.layers.dynamic_gru(
input=pre_gru_r,
size=grnn_hidden_dim,
is_reverse=True,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound),
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
bi_merge = fluid.layers.concat(input=[gru, gru_r], axis=1)
return bi_merge
def _net_conf(word, target=None):
"""
Configure the network
"""
word_embedding = fluid.embedding(
input=word,
size=[vocab_size, word_emb_dim],
dtype='float32',
is_sparse=IS_SPARSE,
param_attr=fluid.ParamAttr(
learning_rate=emb_lr,
name="word_emb",
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound)))
input_feature = word_embedding
for i in range(bigru_num):
bigru_output = _bigru_layer(input_feature)
input_feature = bigru_output
emission = fluid.layers.fc(
size=num_labels,
input=bigru_output,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-init_bound, high=init_bound),
regularizer=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4)))
size = emission.shape[1]
fluid.layers.create_parameter(
shape=[size + 2, size], dtype=emission.dtype, name='crfw')
crf_decode = fluid.layers.crf_decoding(
input=emission, param_attr=fluid.ParamAttr(name='crfw'))
return crf_decode
return _net_conf(word)

View File

@ -0,0 +1,82 @@
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import time
import sys
import paddle.fluid as fluid
import paddle
import jieba.lac_small.utils as utils
import jieba.lac_small.creator as creator
import jieba.lac_small.reader_small as reader_small
import numpy
word_emb_dim=128
grnn_hidden_dim=128
bigru_num=2
use_cuda=False
basepath = os.path.abspath(__file__)
folder = os.path.dirname(basepath)
init_checkpoint = os.path.join(folder, "model_baseline")
batch_size=1
dataset = reader_small.Dataset()
infer_program = fluid.Program()
with fluid.program_guard(infer_program, fluid.default_startup_program()):
with fluid.unique_name.guard():
infer_ret = creator.create_model(dataset.vocab_size, dataset.num_labels, mode='infer')
infer_program = infer_program.clone(for_test=True)
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
utils.init_checkpoint(exe, init_checkpoint, infer_program)
results = []
def get_sent(str1):
feed_data=dataset.get_vars(str1)
a = numpy.array(feed_data).astype(numpy.int64)
a=a.reshape(-1,1)
c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
words, crf_decode = exe.run(
infer_program,
fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
feed={"words":c, },
return_numpy=False,
use_program_cache=True)
sents=[]
sent,tag = utils.parse_result(words, crf_decode, dataset)
sents = sents + sent
return sents
def get_result(str1):
feed_data=dataset.get_vars(str1)
a = numpy.array(feed_data).astype(numpy.int64)
a=a.reshape(-1,1)
c = fluid.create_lod_tensor(a, [[a.shape[0]]], place)
words, crf_decode = exe.run(
infer_program,
fetch_list=[infer_ret['words'], infer_ret['crf_decode']],
feed={"words":c, },
return_numpy=False,
use_program_cache=True)
results=[]
results += utils.parse_result(words, crf_decode, dataset)
return results

View File

@ -0,0 +1,100 @@
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
The file_reader converts raw corpus to input.
"""
import os
import __future__
import io
import paddle
import paddle.fluid as fluid
def load_kv_dict(dict_path,
reverse=False,
delimiter="\t",
key_func=None,
value_func=None):
"""
Load key-value dict from file
"""
result_dict = {}
for line in io.open(dict_path, "r", encoding='utf8'):
terms = line.strip("\n").split(delimiter)
if len(terms) != 2:
continue
if reverse:
value, key = terms
else:
key, value = terms
if key in result_dict:
raise KeyError("key duplicated with [%s]" % (key))
if key_func:
key = key_func(key)
if value_func:
value = value_func(value)
result_dict[key] = value
return result_dict
class Dataset(object):
"""data reader"""
def __init__(self):
# read dict
basepath = os.path.abspath(__file__)
folder = os.path.dirname(basepath)
word_dict_path = os.path.join(folder, "word.dic")
label_dict_path = os.path.join(folder, "tag.dic")
self.word2id_dict = load_kv_dict(
word_dict_path, reverse=True, value_func=int)
self.id2word_dict = load_kv_dict(word_dict_path)
self.label2id_dict = load_kv_dict(
label_dict_path, reverse=True, value_func=int)
self.id2label_dict = load_kv_dict(label_dict_path)
@property
def vocab_size(self):
"""vocabulary size"""
return max(self.word2id_dict.values()) + 1
@property
def num_labels(self):
"""num_labels"""
return max(self.label2id_dict.values()) + 1
def word_to_ids(self, words):
"""convert word to word index"""
word_ids = []
for word in words:
if word not in self.word2id_dict:
word = "OOV"
word_id = self.word2id_dict[word]
word_ids.append(word_id)
return word_ids
def label_to_ids(self, labels):
"""convert label to label index"""
label_ids = []
for label in labels:
if label not in self.label2id_dict:
label = "O"
label_id = self.label2id_dict[label]
label_ids.append(label_id)
return label_ids
def get_vars(self,str1):
words = str1.strip()
word_ids = self.word_to_ids(words)
return word_ids

57
jieba/lac_small/tag.dic Normal file
View File

@ -0,0 +1,57 @@
0 a-B
1 a-I
2 ad-B
3 ad-I
4 an-B
5 an-I
6 c-B
7 c-I
8 d-B
9 d-I
10 f-B
11 f-I
12 m-B
13 m-I
14 n-B
15 n-I
16 nr-B
17 nr-I
18 ns-B
19 ns-I
20 nt-B
21 nt-I
22 nw-B
23 nw-I
24 nz-B
25 nz-I
26 p-B
27 p-I
28 q-B
29 q-I
30 r-B
31 r-I
32 s-B
33 s-I
34 t-B
35 t-I
36 u-B
37 u-I
38 v-B
39 v-I
40 vd-B
41 vd-I
42 vn-B
43 vn-I
44 w-B
45 w-I
46 xc-B
47 xc-I
48 PER-B
49 PER-I
50 LOC-B
51 LOC-I
52 ORG-B
53 ORG-I
54 TIME-B
55 TIME-I
56 O

142
jieba/lac_small/utils.py Normal file
View File

@ -0,0 +1,142 @@
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
util tools
"""
from __future__ import print_function
import os
import sys
import numpy as np
import paddle.fluid as fluid
import io
def str2bool(v):
"""
argparse does not support True or False in python
"""
return v.lower() in ("true", "t", "1")
def parse_result(words, crf_decode, dataset):
""" parse result """
offset_list = (crf_decode.lod())[0]
words = np.array(words)
crf_decode = np.array(crf_decode)
batch_size = len(offset_list) - 1
for sent_index in range(batch_size):
begin, end = offset_list[sent_index], offset_list[sent_index + 1]
sent=[]
for id in words[begin:end]:
if dataset.id2word_dict[str(id[0])]=='OOV':
sent.append(' ')
else:
sent.append(dataset.id2word_dict[str(id[0])])
tags = [
dataset.id2label_dict[str(id[0])] for id in crf_decode[begin:end]
]
sent_out = []
tags_out = []
parital_word = ""
for ind, tag in enumerate(tags):
# for the first word
if parital_word == "":
parital_word = sent[ind]
tags_out.append(tag.split('-')[0])
continue
# for the beginning of word
if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
sent_out.append(parital_word)
tags_out.append(tag.split('-')[0])
parital_word = sent[ind]
continue
parital_word += sent[ind]
# append the last word, except for len(tags)=0
if len(sent_out) < len(tags_out):
sent_out.append(parital_word)
return sent_out,tags_out
def parse_padding_result(words, crf_decode, seq_lens, dataset):
""" parse padding result """
words = np.squeeze(words)
batch_size = len(seq_lens)
batch_out = []
for sent_index in range(batch_size):
sent=[]
for id in words[begin:end]:
if dataset.id2word_dict[str(id[0])]=='OOV':
sent.append(' ')
else:
sent.append(dataset.id2word_dict[str(id[0])])
tags = [
dataset.id2label_dict[str(id)]
for id in crf_decode[sent_index][1:seq_lens[sent_index] - 1]
]
sent_out = []
tags_out = []
parital_word = ""
for ind, tag in enumerate(tags):
# for the first word
if parital_word == "":
parital_word = sent[ind]
tags_out.append(tag.split('-')[0])
continue
# for the beginning of word
if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
sent_out.append(parital_word)
tags_out.append(tag.split('-')[0])
parital_word = sent[ind]
continue
parital_word += sent[ind]
# append the last word, except for len(tags)=0
if len(sent_out) < len(tags_out):
sent_out.append(parital_word)
batch_out.append([sent_out, tags_out])
return batch_out
def init_checkpoint(exe, init_checkpoint_path, main_program):
"""
Init CheckPoint
"""
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
def existed_persitables(var):
"""
If existed presitabels
"""
if not fluid.io.is_persistable(var):
return False
return os.path.exists(os.path.join(init_checkpoint_path, var.name))
fluid.io.load_vars(
exe,
init_checkpoint_path,
main_program=main_program,
predicate=existed_persitables)

20940
jieba/lac_small/word.dic Normal file

File diff suppressed because it is too large Load Diff

394
jieba/posseg/__init__.py Normal file → Executable file
View File

@ -1,120 +1,310 @@
from __future__ import absolute_import, unicode_literals
import pickle
import re
import os
import viterbi
import jieba
import sys
default_encoding = sys.getfilesystemencoding()
from .viterbi import viterbi
from .._compat import *
def load_model(f_name):
_curpath=os.path.normpath( os.path.join( os.getcwd(), os.path.dirname(__file__) ) )
prob_p_path = os.path.join(_curpath,f_name)
if f_name.endswith(".py"):
return eval(open(prob_p_path,"rb").read())
else:
result = {}
for line in open(prob_p_path,"rb"):
line = line.strip()
if line=="":continue
word, _, tag = line.split(' ')
result[word.decode('utf-8')]=tag
return result
PROB_START_P = "prob_start.p"
PROB_TRANS_P = "prob_trans.p"
PROB_EMIT_P = "prob_emit.p"
CHAR_STATE_TAB_P = "char_state_tab.p"
re_han_detail = re.compile("([\u4E00-\u9FD5]+)")
re_skip_detail = re.compile("([\.0-9]+|[a-zA-Z0-9]+)")
re_han_internal = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._]+)")
re_skip_internal = re.compile("(\r\n|\s)")
re_eng = re.compile("[a-zA-Z0-9]+")
re_num = re.compile("[\.0-9]+")
re_eng1 = re.compile('^[a-zA-Z0-9]$', re.U)
prob_start = load_model("prob_start.py")
prob_trans = load_model("prob_trans.py")
prob_emit = load_model("prob_emit.py")
char_state_tab = load_model("char_state_tab.py")
word_tag_tab = load_model("../dict.txt")
def load_model():
# For Jython
start_p = pickle.load(get_module_res("posseg", PROB_START_P))
trans_p = pickle.load(get_module_res("posseg", PROB_TRANS_P))
emit_p = pickle.load(get_module_res("posseg", PROB_EMIT_P))
state = pickle.load(get_module_res("posseg", CHAR_STATE_TAB_P))
return state, start_p, trans_p, emit_p
if sys.platform.startswith("java"):
char_state_tab_P, start_P, trans_P, emit_P = load_model()
else:
from .char_state_tab import P as char_state_tab_P
from .prob_start import P as start_P
from .prob_trans import P as trans_P
from .prob_emit import P as emit_P
class pair(object):
def __init__(self,word,flag):
self.word = word
self.flag = flag
def __unicode__(self):
return self.word+u"/"+self.flag
def __init__(self, word, flag):
self.word = word
self.flag = flag
def __repr__(self):
return self.__str__()
def __unicode__(self):
return '%s/%s' % (self.word, self.flag)
def __str__(self):
return self.__unicode__().encode(default_encoding)
def __repr__(self):
return 'pair(%r, %r)' % (self.word, self.flag)
def encode(self,arg):
return self.__unicode__().encode(arg)
def __str__(self):
if PY2:
return self.__unicode__().encode(default_encoding)
else:
return self.__unicode__()
def __cut(sentence):
prob, pos_list = viterbi.viterbi(sentence,char_state_tab, prob_start, prob_trans, prob_emit)
begin, next = 0,0
def __iter__(self):
return iter((self.word, self.flag))
for i,char in enumerate(sentence):
pos = pos_list[i][0]
if pos=='B':
begin = i
elif pos=='E':
yield pair(sentence[begin:i+1], pos_list[i][1])
next = i+1
elif pos=='S':
yield pair(char,pos_list[i][1])
next = i+1
if next<len(sentence):
yield pair(sentence[next:], pos_list[next][1] )
def __lt__(self, other):
return self.word < other.word
def __cut_DAG(sentence):
DAG = jieba.get_DAG(sentence)
route ={}
jieba.calc(sentence,DAG,0,route=route)
x = 0
buf =u''
N = len(sentence)
while x<N:
y = route[x][1]+1
l_word = sentence[x:y]
if y-x==1:
buf+= l_word
else:
if len(buf)>0:
if len(buf)==1:
yield pair(buf,word_tag_tab.get(buf,'x'))
buf=u''
else:
regognized = __cut(buf)
for t in regognized:
yield t
buf=u''
yield pair(l_word,word_tag_tab.get(l_word,'x'))
x =y
def __eq__(self, other):
return isinstance(other, pair) and self.word == other.word and self.flag == other.flag
if len(buf)>0:
if len(buf)==1:
yield pair(buf,word_tag_tab.get(buf,'x'))
else:
regognized = __cut(buf)
for t in regognized:
yield t
def __hash__(self):
return hash(self.word)
def encode(self, arg):
return self.__unicode__().encode(arg)
def cut(sentence):
if not ( type(sentence) is unicode):
try:
sentence = sentence.decode('utf-8')
except:
sentence = sentence.decode('gbk','ignore')
re_han, re_skip = re.compile(ur"([\u4E00-\u9FA5]+)"), re.compile(ur"[^a-zA-Z0-9+#\n%]")
re_eng,re_num = re.compile(ur"[a-zA-Z+#]+"), re.compile(ur"[0-9]+")
blocks = re_han.split(sentence)
class POSTokenizer(object):
for blk in blocks:
if re_han.match(blk):
for word in __cut_DAG(blk):
yield word
else:
tmp = re_skip.split(blk)
for x in tmp:
if x!="":
if re_num.match(x):
yield pair(x,'m')
elif re_eng.match(x):
yield pair(x,'eng')
else:
yield pair(x,'x')
def __init__(self, tokenizer=None):
self.tokenizer = tokenizer or jieba.Tokenizer()
self.load_word_tag(self.tokenizer.get_dict_file())
def __repr__(self):
return '<POSTokenizer tokenizer=%r>' % self.tokenizer
def __getattr__(self, name):
if name in ('cut_for_search', 'lcut_for_search', 'tokenize'):
# may be possible?
raise NotImplementedError
return getattr(self.tokenizer, name)
def initialize(self, dictionary=None):
self.tokenizer.initialize(dictionary)
self.load_word_tag(self.tokenizer.get_dict_file())
def load_word_tag(self, f):
self.word_tag_tab = {}
f_name = resolve_filename(f)
for lineno, line in enumerate(f, 1):
try:
line = line.strip().decode("utf-8")
if not line:
continue
word, _, tag = line.split(" ")
self.word_tag_tab[word] = tag
except Exception:
raise ValueError(
'invalid POS dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
f.close()
def makesure_userdict_loaded(self):
if self.tokenizer.user_word_tag_tab:
self.word_tag_tab.update(self.tokenizer.user_word_tag_tab)
self.tokenizer.user_word_tag_tab = {}
def __cut(self, sentence):
prob, pos_list = viterbi(
sentence, char_state_tab_P, start_P, trans_P, emit_P)
begin, nexti = 0, 0
for i, char in enumerate(sentence):
pos = pos_list[i][0]
if pos == 'B':
begin = i
elif pos == 'E':
yield pair(sentence[begin:i + 1], pos_list[i][1])
nexti = i + 1
elif pos == 'S':
yield pair(char, pos_list[i][1])
nexti = i + 1
if nexti < len(sentence):
yield pair(sentence[nexti:], pos_list[nexti][1])
def __cut_detail(self, sentence):
blocks = re_han_detail.split(sentence)
for blk in blocks:
if re_han_detail.match(blk):
for word in self.__cut(blk):
yield word
else:
tmp = re_skip_detail.split(blk)
for x in tmp:
if x:
if re_num.match(x):
yield pair(x, 'm')
elif re_eng.match(x):
yield pair(x, 'eng')
else:
yield pair(x, 'x')
def __cut_DAG_NO_HMM(self, sentence):
DAG = self.tokenizer.get_DAG(sentence)
route = {}
self.tokenizer.calc(sentence, DAG, route)
x = 0
N = len(sentence)
buf = ''
while x < N:
y = route[x][1] + 1
l_word = sentence[x:y]
if re_eng1.match(l_word):
buf += l_word
x = y
else:
if buf:
yield pair(buf, 'eng')
buf = ''
yield pair(l_word, self.word_tag_tab.get(l_word, 'x'))
x = y
if buf:
yield pair(buf, 'eng')
buf = ''
def __cut_DAG(self, sentence):
DAG = self.tokenizer.get_DAG(sentence)
route = {}
self.tokenizer.calc(sentence, DAG, route)
x = 0
buf = ''
N = len(sentence)
while x < N:
y = route[x][1] + 1
l_word = sentence[x:y]
if y - x == 1:
buf += l_word
else:
if buf:
if len(buf) == 1:
yield pair(buf, self.word_tag_tab.get(buf, 'x'))
elif not self.tokenizer.FREQ.get(buf):
recognized = self.__cut_detail(buf)
for t in recognized:
yield t
else:
for elem in buf:
yield pair(elem, self.word_tag_tab.get(elem, 'x'))
buf = ''
yield pair(l_word, self.word_tag_tab.get(l_word, 'x'))
x = y
if buf:
if len(buf) == 1:
yield pair(buf, self.word_tag_tab.get(buf, 'x'))
elif not self.tokenizer.FREQ.get(buf):
recognized = self.__cut_detail(buf)
for t in recognized:
yield t
else:
for elem in buf:
yield pair(elem, self.word_tag_tab.get(elem, 'x'))
def __cut_internal(self, sentence, HMM=True):
self.makesure_userdict_loaded()
sentence = strdecode(sentence)
blocks = re_han_internal.split(sentence)
if HMM:
cut_blk = self.__cut_DAG
else:
cut_blk = self.__cut_DAG_NO_HMM
for blk in blocks:
if re_han_internal.match(blk):
for word in cut_blk(blk):
yield word
else:
tmp = re_skip_internal.split(blk)
for x in tmp:
if re_skip_internal.match(x):
yield pair(x, 'x')
else:
for xx in x:
if re_num.match(xx):
yield pair(xx, 'm')
elif re_eng.match(x):
yield pair(xx, 'eng')
else:
yield pair(xx, 'x')
def _lcut_internal(self, sentence):
return list(self.__cut_internal(sentence))
def _lcut_internal_no_hmm(self, sentence):
return list(self.__cut_internal(sentence, False))
def cut(self, sentence, HMM=True):
for w in self.__cut_internal(sentence, HMM=HMM):
yield w
def lcut(self, *args, **kwargs):
return list(self.cut(*args, **kwargs))
# default Tokenizer instance
dt = POSTokenizer(jieba.dt)
# global functions
initialize = dt.initialize
def _lcut_internal(s):
return dt._lcut_internal(s)
def _lcut_internal_no_hmm(s):
return dt._lcut_internal_no_hmm(s)
def cut(sentence, HMM=True, use_paddle=False):
"""
Global `cut` function that supports parallel processing.
Note that this only works using dt, custom POSTokenizer
instances are not supported.
"""
is_paddle_installed = check_paddle_install['is_paddle_installed']
if use_paddle and is_paddle_installed:
# if sentence is null, it will raise core exception in paddle.
if sentence is None or sentence == "" or sentence == u"":
return
import jieba.lac_small.predict as predict
sents, tags = predict.get_result(strdecode(sentence))
for i, sent in enumerate(sents):
if sent is None or tags[i] is None:
continue
yield pair(sent, tags[i])
return
global dt
if jieba.pool is None:
for w in dt.cut(sentence, HMM=HMM):
yield w
else:
parts = strdecode(sentence).splitlines(True)
if HMM:
result = jieba.pool.map(_lcut_internal, parts)
else:
result = jieba.pool.map(_lcut_internal_no_hmm, parts)
for r in result:
for w in r:
yield w
def lcut(sentence, HMM=True, use_paddle=False):
if use_paddle:
return list(cut(sentence, use_paddle=True))
return list(cut(sentence, HMM))

335946
jieba/posseg/char_state_tab.p Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

269408
jieba/posseg/prob_emit.p Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

1094
jieba/posseg/prob_start.p Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1,256 +1,256 @@
{('B', 'a'): 0.008545886571090637,
('B', 'ad'): 0.0012556950477614949,
('B', 'ag'): 0.0,
('B', 'an'): 0.0001670724139577068,
('B', 'b'): 0.006615272009801582,
('B', 'bg'): 0.0,
('B', 'c'): 0.03258575057944956,
('B', 'd'): 0.018778408940230508,
('B', 'df'): 0.00013790104009207547,
('B', 'dg'): 0.0,
('B', 'e'): 0.00019093990166595064,
('B', 'en'): 0.0,
('B', 'f'): 0.004121119544290101,
('B', 'g'): 0.0,
('B', 'h'): 1.3259715393468796e-06,
('B', 'i'): 0.0022077426130125543,
('B', 'in'): 0.0,
('B', 'j'): 0.006360685474246981,
('B', 'jn'): 0.0,
('B', 'k'): 0.0,
('B', 'l'): 0.007402899104173628,
('B', 'ln'): 0.0,
('B', 'm'): 0.02592804748038888,
('B', 'mg'): 0.0,
('B', 'mq'): 0.0011284017799841944,
('B', 'n'): 0.18330097962777328,
('B', 'ng'): 0.0,
('B', 'nr'): 0.10741562843095136,
('B', 'nrfg'): 0.0028123856349547313,
('B', 'nrt'): 0.006835383285333164,
('B', 'ns'): 0.05943667425122387,
('B', 'nt'): 0.007859033313708954,
('B', 'nz'): 0.0193127754705873,
('B', 'o'): 0.00021745933245288822,
('B', 'p'): 0.014980826451541043,
('B', 'q'): 0.00091359439061,
('B', 'qe'): 0.0,
('B', 'qg'): 0.0,
('B', 'r'): 0.033047188675142274,
('B', 'rg'): 0.0,
('B', 'rr'): 3.977914618040638e-06,
('B', 'rz'): 0.0003540344010056168,
('B', 's'): 0.0039951522480521475,
('B', 't'): 0.03457072997385184,
('B', 'tg'): 0.0,
('B', 'u'): 0.00010475175160840347,
('B', 'ud'): 0.0,
('B', 'ug'): 0.0,
('B', 'uj'): 0.0,
('B', 'ul'): 0.0,
('B', 'uv'): 0.0,
('B', 'uz'): 0.0,
('B', 'v'): 0.06897173559066729,
('B', 'vd'): 0.00011801146700187228,
('B', 'vg'): 0.0,
('B', 'vi'): 3.977914618040638e-06,
('B', 'vn'): 0.01314700781262431,
('B', 'vq'): 5.303886157387518e-06,
('B', 'w'): 0.0,
('B', 'x'): 0.0,
('B', 'y'): 5.303886157387518e-05,
('B', 'yg'): 0.0,
('B', 'z'): 0.0008711633013508998,
('B', 'zg'): 0.0,
('E', 'a'): 0.0,
('E', 'ad'): 0.0,
('E', 'ag'): 0.0,
('E', 'an'): 0.0,
('E', 'b'): 0.0,
('E', 'bg'): 0.0,
('E', 'c'): 0.0,
('E', 'd'): 0.0,
('E', 'df'): 0.0,
('E', 'dg'): 0.0,
('E', 'e'): 0.0,
('E', 'en'): 0.0,
('E', 'f'): 0.0,
('E', 'g'): 0.0,
('E', 'h'): 0.0,
('E', 'i'): 0.0,
('E', 'in'): 0.0,
('E', 'j'): 0.0,
('E', 'jn'): 0.0,
('E', 'k'): 0.0,
('E', 'l'): 0.0,
('E', 'ln'): 0.0,
('E', 'm'): 0.0,
('E', 'mg'): 0.0,
('E', 'mq'): 0.0,
('E', 'n'): 0.0,
('E', 'ng'): 0.0,
('E', 'nr'): 0.0,
('E', 'nrfg'): 0.0,
('E', 'nrt'): 0.0,
('E', 'ns'): 0.0,
('E', 'nt'): 0.0,
('E', 'nz'): 0.0,
('E', 'o'): 0.0,
('E', 'p'): 0.0,
('E', 'q'): 0.0,
('E', 'qe'): 0.0,
('E', 'qg'): 0.0,
('E', 'r'): 0.0,
('E', 'rg'): 0.0,
('E', 'rr'): 0.0,
('E', 'rz'): 0.0,
('E', 's'): 0.0,
('E', 't'): 0.0,
('E', 'tg'): 0.0,
('E', 'u'): 0.0,
('E', 'ud'): 0.0,
('E', 'ug'): 0.0,
('E', 'uj'): 0.0,
('E', 'ul'): 0.0,
('E', 'uv'): 0.0,
('E', 'uz'): 0.0,
('E', 'v'): 0.0,
('E', 'vd'): 0.0,
('E', 'vg'): 0.0,
('E', 'vi'): 0.0,
('E', 'vn'): 0.0,
('E', 'vq'): 0.0,
('E', 'w'): 0.0,
('E', 'x'): 0.0,
('E', 'y'): 0.0,
('E', 'yg'): 0.0,
('E', 'z'): 0.0,
('E', 'zg'): 0.0,
('M', 'a'): 0.0,
('M', 'ad'): 0.0,
('M', 'ag'): 0.0,
('M', 'an'): 0.0,
('M', 'b'): 0.0,
('M', 'bg'): 0.0,
('M', 'c'): 0.0,
('M', 'd'): 0.0,
('M', 'df'): 0.0,
('M', 'dg'): 0.0,
('M', 'e'): 0.0,
('M', 'en'): 0.0,
('M', 'f'): 0.0,
('M', 'g'): 0.0,
('M', 'h'): 0.0,
('M', 'i'): 0.0,
('M', 'in'): 0.0,
('M', 'j'): 0.0,
('M', 'jn'): 0.0,
('M', 'k'): 0.0,
('M', 'l'): 0.0,
('M', 'ln'): 0.0,
('M', 'm'): 0.0,
('M', 'mg'): 0.0,
('M', 'mq'): 0.0,
('M', 'n'): 0.0,
('M', 'ng'): 0.0,
('M', 'nr'): 0.0,
('M', 'nrfg'): 0.0,
('M', 'nrt'): 0.0,
('M', 'ns'): 0.0,
('M', 'nt'): 0.0,
('M', 'nz'): 0.0,
('M', 'o'): 0.0,
('M', 'p'): 0.0,
('M', 'q'): 0.0,
('M', 'qe'): 0.0,
('M', 'qg'): 0.0,
('M', 'r'): 0.0,
('M', 'rg'): 0.0,
('M', 'rr'): 0.0,
('M', 'rz'): 0.0,
('M', 's'): 0.0,
('M', 't'): 0.0,
('M', 'tg'): 0.0,
('M', 'u'): 0.0,
('M', 'ud'): 0.0,
('M', 'ug'): 0.0,
('M', 'uj'): 0.0,
('M', 'ul'): 0.0,
('M', 'uv'): 0.0,
('M', 'uz'): 0.0,
('M', 'v'): 0.0,
('M', 'vd'): 0.0,
('M', 'vg'): 0.0,
('M', 'vi'): 0.0,
('M', 'vn'): 0.0,
('M', 'vq'): 0.0,
('M', 'w'): 0.0,
('M', 'x'): 0.0,
('M', 'y'): 0.0,
('M', 'yg'): 0.0,
('M', 'z'): 0.0,
('M', 'zg'): 0.0,
('S', 'a'): 0.020190568629634933,
('S', 'ad'): 1.5911658472162552e-05,
('S', 'ag'): 0.0009546995083297532,
('S', 'an'): 2.651943078693759e-06,
('S', 'b'): 0.0015447568433391145,
('S', 'bg'): 0.0,
('S', 'c'): 0.008337709039413178,
('S', 'd'): 0.020162723227308648,
('S', 'df'): 0.0,
('S', 'dg'): 0.0001299452108559942,
('S', 'e'): 0.0026254236479068215,
('S', 'en'): 0.0,
('S', 'f'): 0.0055452129775486496,
('S', 'g'): 0.0014917179817652395,
('S', 'h'): 0.00017502824319378808,
('S', 'i'): 0.0,
('S', 'in'): 0.0,
('S', 'j'): 0.007357816071835834,
('S', 'jn'): 0.0,
('S', 'k'): 0.000967959223723222,
('S', 'l'): 0.0,
('S', 'ln'): 0.0,
('S', 'm'): 0.038036819577704585,
('S', 'mg'): 1.988957309020319e-05,
('S', 'mq'): 0.0,
('S', 'n'): 0.021170461597212278,
('S', 'ng'): 0.007347208299521059,
('S', 'nr'): 0.011291973629078026,
('S', 'nrfg'): 0.0,
('S', 'nrt'): 0.0,
('S', 'ns'): 0.0,
('S', 'nt'): 5.303886157387518e-06,
('S', 'nz'): 0.0,
('S', 'o'): 0.00021082947475615385,
('S', 'p'): 0.05044658721445203,
('S', 'q'): 0.007531518343490275,
('S', 'qe'): 0.0,
('S', 'qg'): 0.0,
('S', 'r'): 0.06306851029749498,
('S', 'rg'): 3.447526002301887e-05,
('S', 'rr'): 0.0,
('S', 'rz'): 0.0,
('S', 's'): 0.0,
('S', 't'): 0.0,
('S', 'tg'): 0.0018868575004906095,
('S', 'u'): 0.000967959223723222,
('S', 'ud'): 0.000440222551063164,
('S', 'ug'): 0.0005317145872780986,
('S', 'uj'): 0.001056799316859463,
('S', 'ul'): 0.00022143724707092888,
('S', 'uv'): 0.00028640985249892595,
('S', 'uz'): 9.149203621493468e-05,
('S', 'v'): 0.04720326082920956,
('S', 'vd'): 0.0,
('S', 'vg'): 0.0026240976763674743,
('S', 'vi'): 0.0,
('S', 'vn'): 1.0607772314775036e-05,
('S', 'vq'): 0.0,
('S', 'w'): 0.0,
('S', 'x'): 0.0002187853039922351,
('S', 'y'): 0.00203536631289746,
('S', 'yg'): 1.3259715393468796e-06,
('S', 'z'): 0.0,
('S', 'zg'): 0.0}
P={('B', 'a'): -4.762305214596967,
('B', 'ad'): -6.680066036784177,
('B', 'ag'): -3.14e+100,
('B', 'an'): -8.697083223018778,
('B', 'b'): -5.018374362109218,
('B', 'bg'): -3.14e+100,
('B', 'c'): -3.423880184954888,
('B', 'd'): -3.9750475297585357,
('B', 'df'): -8.888974230828882,
('B', 'dg'): -3.14e+100,
('B', 'e'): -8.563551830394255,
('B', 'en'): -3.14e+100,
('B', 'f'): -5.491630418482717,
('B', 'g'): -3.14e+100,
('B', 'h'): -13.533365129970255,
('B', 'i'): -6.1157847275557105,
('B', 'in'): -3.14e+100,
('B', 'j'): -5.0576191284681915,
('B', 'jn'): -3.14e+100,
('B', 'k'): -3.14e+100,
('B', 'l'): -4.905883584659895,
('B', 'ln'): -3.14e+100,
('B', 'm'): -3.6524299819046386,
('B', 'mg'): -3.14e+100,
('B', 'mq'): -6.78695300139688,
('B', 'n'): -1.6966257797548328,
('B', 'ng'): -3.14e+100,
('B', 'nr'): -2.2310495913769506,
('B', 'nrfg'): -5.873722175405573,
('B', 'nrt'): -4.985642733519195,
('B', 'ns'): -2.8228438314969213,
('B', 'nt'): -4.846091668182416,
('B', 'nz'): -3.94698846057672,
('B', 'o'): -8.433498702146057,
('B', 'p'): -4.200984132085048,
('B', 'q'): -6.998123858956596,
('B', 'qe'): -3.14e+100,
('B', 'qg'): -3.14e+100,
('B', 'r'): -3.4098187790818413,
('B', 'rg'): -3.14e+100,
('B', 'rr'): -12.434752841302146,
('B', 'rz'): -7.946116471570005,
('B', 's'): -5.522673590839954,
('B', 't'): -3.3647479094528574,
('B', 'tg'): -3.14e+100,
('B', 'u'): -9.163917277503234,
('B', 'ud'): -3.14e+100,
('B', 'ug'): -3.14e+100,
('B', 'uj'): -3.14e+100,
('B', 'ul'): -3.14e+100,
('B', 'uv'): -3.14e+100,
('B', 'uz'): -3.14e+100,
('B', 'v'): -2.6740584874265685,
('B', 'vd'): -9.044728760238115,
('B', 'vg'): -3.14e+100,
('B', 'vi'): -12.434752841302146,
('B', 'vn'): -4.3315610890163585,
('B', 'vq'): -12.147070768850364,
('B', 'w'): -3.14e+100,
('B', 'x'): -3.14e+100,
('B', 'y'): -9.844485675856319,
('B', 'yg'): -3.14e+100,
('B', 'z'): -7.045681111485645,
('B', 'zg'): -3.14e+100,
('E', 'a'): -3.14e+100,
('E', 'ad'): -3.14e+100,
('E', 'ag'): -3.14e+100,
('E', 'an'): -3.14e+100,
('E', 'b'): -3.14e+100,
('E', 'bg'): -3.14e+100,
('E', 'c'): -3.14e+100,
('E', 'd'): -3.14e+100,
('E', 'df'): -3.14e+100,
('E', 'dg'): -3.14e+100,
('E', 'e'): -3.14e+100,
('E', 'en'): -3.14e+100,
('E', 'f'): -3.14e+100,
('E', 'g'): -3.14e+100,
('E', 'h'): -3.14e+100,
('E', 'i'): -3.14e+100,
('E', 'in'): -3.14e+100,
('E', 'j'): -3.14e+100,
('E', 'jn'): -3.14e+100,
('E', 'k'): -3.14e+100,
('E', 'l'): -3.14e+100,
('E', 'ln'): -3.14e+100,
('E', 'm'): -3.14e+100,
('E', 'mg'): -3.14e+100,
('E', 'mq'): -3.14e+100,
('E', 'n'): -3.14e+100,
('E', 'ng'): -3.14e+100,
('E', 'nr'): -3.14e+100,
('E', 'nrfg'): -3.14e+100,
('E', 'nrt'): -3.14e+100,
('E', 'ns'): -3.14e+100,
('E', 'nt'): -3.14e+100,
('E', 'nz'): -3.14e+100,
('E', 'o'): -3.14e+100,
('E', 'p'): -3.14e+100,
('E', 'q'): -3.14e+100,
('E', 'qe'): -3.14e+100,
('E', 'qg'): -3.14e+100,
('E', 'r'): -3.14e+100,
('E', 'rg'): -3.14e+100,
('E', 'rr'): -3.14e+100,
('E', 'rz'): -3.14e+100,
('E', 's'): -3.14e+100,
('E', 't'): -3.14e+100,
('E', 'tg'): -3.14e+100,
('E', 'u'): -3.14e+100,
('E', 'ud'): -3.14e+100,
('E', 'ug'): -3.14e+100,
('E', 'uj'): -3.14e+100,
('E', 'ul'): -3.14e+100,
('E', 'uv'): -3.14e+100,
('E', 'uz'): -3.14e+100,
('E', 'v'): -3.14e+100,
('E', 'vd'): -3.14e+100,
('E', 'vg'): -3.14e+100,
('E', 'vi'): -3.14e+100,
('E', 'vn'): -3.14e+100,
('E', 'vq'): -3.14e+100,
('E', 'w'): -3.14e+100,
('E', 'x'): -3.14e+100,
('E', 'y'): -3.14e+100,
('E', 'yg'): -3.14e+100,
('E', 'z'): -3.14e+100,
('E', 'zg'): -3.14e+100,
('M', 'a'): -3.14e+100,
('M', 'ad'): -3.14e+100,
('M', 'ag'): -3.14e+100,
('M', 'an'): -3.14e+100,
('M', 'b'): -3.14e+100,
('M', 'bg'): -3.14e+100,
('M', 'c'): -3.14e+100,
('M', 'd'): -3.14e+100,
('M', 'df'): -3.14e+100,
('M', 'dg'): -3.14e+100,
('M', 'e'): -3.14e+100,
('M', 'en'): -3.14e+100,
('M', 'f'): -3.14e+100,
('M', 'g'): -3.14e+100,
('M', 'h'): -3.14e+100,
('M', 'i'): -3.14e+100,
('M', 'in'): -3.14e+100,
('M', 'j'): -3.14e+100,
('M', 'jn'): -3.14e+100,
('M', 'k'): -3.14e+100,
('M', 'l'): -3.14e+100,
('M', 'ln'): -3.14e+100,
('M', 'm'): -3.14e+100,
('M', 'mg'): -3.14e+100,
('M', 'mq'): -3.14e+100,
('M', 'n'): -3.14e+100,
('M', 'ng'): -3.14e+100,
('M', 'nr'): -3.14e+100,
('M', 'nrfg'): -3.14e+100,
('M', 'nrt'): -3.14e+100,
('M', 'ns'): -3.14e+100,
('M', 'nt'): -3.14e+100,
('M', 'nz'): -3.14e+100,
('M', 'o'): -3.14e+100,
('M', 'p'): -3.14e+100,
('M', 'q'): -3.14e+100,
('M', 'qe'): -3.14e+100,
('M', 'qg'): -3.14e+100,
('M', 'r'): -3.14e+100,
('M', 'rg'): -3.14e+100,
('M', 'rr'): -3.14e+100,
('M', 'rz'): -3.14e+100,
('M', 's'): -3.14e+100,
('M', 't'): -3.14e+100,
('M', 'tg'): -3.14e+100,
('M', 'u'): -3.14e+100,
('M', 'ud'): -3.14e+100,
('M', 'ug'): -3.14e+100,
('M', 'uj'): -3.14e+100,
('M', 'ul'): -3.14e+100,
('M', 'uv'): -3.14e+100,
('M', 'uz'): -3.14e+100,
('M', 'v'): -3.14e+100,
('M', 'vd'): -3.14e+100,
('M', 'vg'): -3.14e+100,
('M', 'vi'): -3.14e+100,
('M', 'vn'): -3.14e+100,
('M', 'vq'): -3.14e+100,
('M', 'w'): -3.14e+100,
('M', 'x'): -3.14e+100,
('M', 'y'): -3.14e+100,
('M', 'yg'): -3.14e+100,
('M', 'z'): -3.14e+100,
('M', 'zg'): -3.14e+100,
('S', 'a'): -3.9025396831295227,
('S', 'ad'): -11.048458480182255,
('S', 'ag'): -6.954113917960154,
('S', 'an'): -12.84021794941031,
('S', 'b'): -6.472888763970454,
('S', 'bg'): -3.14e+100,
('S', 'c'): -4.786966795861212,
('S', 'd'): -3.903919764181873,
('S', 'df'): -3.14e+100,
('S', 'dg'): -8.948397651299683,
('S', 'e'): -5.942513006281674,
('S', 'en'): -3.14e+100,
('S', 'f'): -5.194820249981676,
('S', 'g'): -6.507826815331734,
('S', 'h'): -8.650563207383884,
('S', 'i'): -3.14e+100,
('S', 'in'): -3.14e+100,
('S', 'j'): -4.911992119644354,
('S', 'jn'): -3.14e+100,
('S', 'k'): -6.940320595827818,
('S', 'l'): -3.14e+100,
('S', 'ln'): -3.14e+100,
('S', 'm'): -3.269200652116097,
('S', 'mg'): -10.825314928868044,
('S', 'mq'): -3.14e+100,
('S', 'n'): -3.8551483897645107,
('S', 'ng'): -4.913434861102905,
('S', 'nr'): -4.483663103956885,
('S', 'nrfg'): -3.14e+100,
('S', 'nrt'): -3.14e+100,
('S', 'ns'): -3.14e+100,
('S', 'nt'): -12.147070768850364,
('S', 'nz'): -3.14e+100,
('S', 'o'): -8.464460927750023,
('S', 'p'): -2.9868401813596317,
('S', 'q'): -4.888658618255058,
('S', 'qe'): -3.14e+100,
('S', 'qg'): -3.14e+100,
('S', 'r'): -2.7635336784127853,
('S', 'rg'): -10.275268591948773,
('S', 'rr'): -3.14e+100,
('S', 'rz'): -3.14e+100,
('S', 's'): -3.14e+100,
('S', 't'): -3.14e+100,
('S', 'tg'): -6.272842531880403,
('S', 'u'): -6.940320595827818,
('S', 'ud'): -7.728230161053767,
('S', 'ug'): -7.5394037026636855,
('S', 'uj'): -6.85251045118004,
('S', 'ul'): -8.4153713175535,
('S', 'uv'): -8.15808672228609,
('S', 'uz'): -9.299258625372996,
('S', 'v'): -3.053292303412302,
('S', 'vd'): -3.14e+100,
('S', 'vg'): -5.9430181843676895,
('S', 'vi'): -3.14e+100,
('S', 'vn'): -11.453923588290419,
('S', 'vq'): -3.14e+100,
('S', 'w'): -3.14e+100,
('S', 'x'): -8.427419656069674,
('S', 'y'): -6.1970794699489575,
('S', 'yg'): -13.533365129970255,
('S', 'z'): -3.14e+100,
('S', 'zg'): -3.14e+100}

11530
jieba/posseg/prob_trans.p Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,42 +1,53 @@
import sys
import operator
MIN_FLOAT = -3.14e100
MIN_INF = float("-inf")
if sys.version_info[0] > 2:
xrange = range
def get_top_states(t_state_v, K=4):
return sorted(t_state_v, key=t_state_v.__getitem__, reverse=True)[:K]
def get_top_states(t_state_v,K=4):
items = t_state_v.items()
topK= sorted(items,key=operator.itemgetter(1),reverse=True)[:K]
return [x[0] for x in topK]
def viterbi(obs, states, start_p, trans_p, emit_p):
V = [{}] #tabular
mem_path = [{}]
all_states = trans_p.keys()
for y in states.get(obs[0],all_states): #init
V[0][y] = start_p[y] * emit_p[y].get(obs[0],0)
mem_path[0][y] = ''
for t in range(1,len(obs)):
V.append({})
mem_path.append({})
prev_states = get_top_states(V[t-1])
prev_states =[ x for x in mem_path[t-1].keys() if len(trans_p[x])>0 ]
V = [{}] # tabular
mem_path = [{}]
all_states = trans_p.keys()
for y in states.get(obs[0], all_states): # init
V[0][y] = start_p[y] + emit_p[y].get(obs[0], MIN_FLOAT)
mem_path[0][y] = ''
for t in xrange(1, len(obs)):
V.append({})
mem_path.append({})
#prev_states = get_top_states(V[t-1])
prev_states = [
x for x in mem_path[t - 1].keys() if len(trans_p[x]) > 0]
prev_states_expect_next = set( (y for x in prev_states for y in trans_p[x].keys() ) )
obs_states = states.get(obs[t],all_states)
obs_states = set(obs_states) & set(prev_states_expect_next)
prev_states_expect_next = set(
(y for x in prev_states for y in trans_p[x].keys()))
obs_states = set(
states.get(obs[t], all_states)) & prev_states_expect_next
if len(obs_states)==0: obs_states = all_states
for y in obs_states:
(prob,state ) = max([(V[t-1][y0] * trans_p[y0].get(y,0) * emit_p[y].get(obs[t],0) ,y0) for y0 in prev_states])
V[t][y] =prob
mem_path[t][y] = state
if not obs_states:
obs_states = prev_states_expect_next if prev_states_expect_next else all_states
last = [(V[-1][y], y) for y in mem_path[-1].keys() ]
#if len(last)==0:
#print obs
(prob, state) = max(last)
for y in obs_states:
prob, state = max((V[t - 1][y0] + trans_p[y0].get(y, MIN_INF) +
emit_p[y].get(obs[t], MIN_FLOAT), y0) for y0 in prev_states)
V[t][y] = prob
mem_path[t][y] = state
route = [None] * len(obs)
i = len(obs)-1
while i>=0:
route[i] = state
state = mem_path[i][state]
i-=1
return (prob, route)
last = [(V[-1][y], y) for y in mem_path[-1].keys()]
# if len(last)==0:
# print obs
prob, state = max(last)
route = [None] * len(obs)
i = len(obs) - 1
while i >= 0:
route[i] = state
state = mem_path[i][state]
i -= 1
return (prob, route)

View File

@ -1,11 +1,75 @@
# -*- coding: utf-8 -*-
from distutils.core import setup
LONGDOC = """
jieba
=====
结巴中文分词做最好的 Python 中文分词组件
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to
be the best Python Chinese word segmentation module.
完整文档见 ``README.md``
GitHub: https://github.com/fxsjy/jieba
特点
====
- 支持三种分词模式
- 精确模式试图将句子最精确地切开适合文本分析
- 全模式把句子中所有的可以成词的词语都扫描出来,
速度非常快但是不能解决歧义
- 搜索引擎模式在精确模式的基础上对长词再次切分提高召回率适合用于搜索引擎分词
- 支持繁体分词
- 支持自定义词典
- MIT 授权协议
在线演示 http://jiebademo.ap01.aws.af.cm/
安装说明
========
代码对 Python 2/3 均兼容
- 全自动安装 ``easy_install jieba`` 或者 ``pip install jieba`` / ``pip3 install jieba``
- 半自动安装先下载 https://pypi.python.org/pypi/jieba/ 解压后运行
python setup.py install
- 手动安装 jieba 目录放置于当前目录或者 site-packages 目录
- 通过 ``import jieba`` 来引用
"""
setup(name='jieba',
version='0.22',
description='Chinese Words Segementation Utilities',
version='0.42.1',
description='Chinese Words Segmentation Utilities',
long_description=LONGDOC,
author='Sun, Junyi',
author_email='ccnusjy@gmail.com',
url='http://github.com/fxsjy',
url='https://github.com/fxsjy/jieba',
license="MIT",
classifiers=[
'Intended Audience :: Developers',
'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
'Natural Language :: Chinese (Simplified)',
'Natural Language :: Chinese (Traditional)',
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.6',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.2',
'Programming Language :: Python :: 3.3',
'Programming Language :: Python :: 3.4',
'Topic :: Text Processing',
'Topic :: Text Processing :: Indexing',
'Topic :: Text Processing :: Linguistic',
],
keywords='NLP,tokenizing,Chinese word segementation',
packages=['jieba'],
package_dir={'jieba':'jieba'},
package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*']}
package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*', 'lac_small/*.py','lac_small/*.dic', 'lac_small/model_baseline/*']}
)

View File

@ -1,17 +1,84 @@
#encoding=utf-8
from __future__ import unicode_literals
import sys
sys.path.append("../")
import jieba
import jieba.posseg
import jieba.analyse
seg_list = jieba.cut("我来到北京清华大学",cut_all=True)
print "Full Mode:", "/ ".join(seg_list) #全模式
print('='*40)
print('1. 分词')
print('-'*40)
seg_list = jieba.cut("我来到北京清华大学",cut_all=False)
print "Default Mode:", "/ ".join(seg_list) #默认模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list)) # 默认模式
seg_list = jieba.cut("他来到了网易杭研大厦")
print ", ".join(seg_list)
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式
print ", ".join(seg_list)
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))
print('='*40)
print('2. 添加自定义词典/调整词典')
print('-'*40)
print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
#如果/放到/post/中将/出错/。
print(jieba.suggest_freq(('', ''), True))
#494
print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
#如果/放到/post/中/将/出错/。
print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
#「/台/中/」/正确/应该/不会/被/切开
print(jieba.suggest_freq('台中', True))
#69
print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
#「/台中/」/正确/应该/不会/被/切开
print('='*40)
print('3. 关键词提取')
print('-'*40)
print(' TF-IDF')
print('-'*40)
s = "此外公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元增资后吉林欧亚置业注册资本由7000万元增加到5亿元。吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城市商业综合体项目。2013年实现营业收入0万元实现净利润-139.13万元。"
for x, w in jieba.analyse.extract_tags(s, withWeight=True):
print('%s %s' % (x, w))
print('-'*40)
print(' TextRank')
print('-'*40)
for x, w in jieba.analyse.textrank(s, withWeight=True):
print('%s %s' % (x, w))
print('='*40)
print('4. 词性标注')
print('-'*40)
words = jieba.posseg.cut("我爱北京天安门")
for word, flag in words:
print('%s %s' % (word, flag))
print('='*40)
print('6. Tokenize: 返回词语在原文的起止位置')
print('-'*40)
print(' 默认模式')
print('-'*40)
result = jieba.tokenize('永和服装饰品有限公司')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
print('-'*40)
print(' 搜索模式')
print('-'*40)
result = jieba.tokenize('永和服装饰品有限公司', mode='search')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

View File

@ -5,29 +5,26 @@ import jieba
import jieba.analyse
from optparse import OptionParser
USAGE ="usage: python extract_tags.py [file name] -k [top k]"
USAGE = "usage: python extract_tags.py [file name] -k [top k]"
parser = OptionParser(USAGE)
parser.add_option("-k",dest="topK")
parser.add_option("-k", dest="topK")
opt, args = parser.parse_args()
if len(args) <1:
print USAGE
sys.exit(1)
if len(args) < 1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK==None:
topK=10
if opt.topK is None:
topK = 10
else:
topK = int(opt.topK)
topK = int(opt.topK)
content = open(file_name, 'rb').read()
content = open(file_name,'rb').read()
tags = jieba.analyse.extract_tags(content,topK=topK)
print ",".join(tags)
tags = jieba.analyse.extract_tags(content, topK=topK)
print(",".join(tags))

View File

@ -0,0 +1,32 @@
import sys
sys.path.append('../')
import jieba
import jieba.analyse
from optparse import OptionParser
USAGE = "usage: python extract_tags_idfpath.py [file name] -k [top k]"
parser = OptionParser(USAGE)
parser.add_option("-k", dest="topK")
opt, args = parser.parse_args()
if len(args) < 1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK is None:
topK = 10
else:
topK = int(opt.topK)
content = open(file_name, 'rb').read()
jieba.analyse.set_idf_path("../extra_dict/idf.txt.big");
tags = jieba.analyse.extract_tags(content, topK=topK)
print(",".join(tags))

View File

@ -0,0 +1,33 @@
import sys
sys.path.append('../')
import jieba
import jieba.analyse
from optparse import OptionParser
USAGE = "usage: python extract_tags_stop_words.py [file name] -k [top k]"
parser = OptionParser(USAGE)
parser.add_option("-k", dest="topK")
opt, args = parser.parse_args()
if len(args) < 1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK is None:
topK = 10
else:
topK = int(opt.topK)
content = open(file_name, 'rb').read()
jieba.analyse.set_stop_words("../extra_dict/stop_words.txt")
jieba.analyse.set_idf_path("../extra_dict/idf.txt.big");
tags = jieba.analyse.extract_tags(content, topK=topK)
print(",".join(tags))

View File

@ -0,0 +1,43 @@
import sys
sys.path.append('../')
import jieba
import jieba.analyse
from optparse import OptionParser
USAGE = "usage: python extract_tags_with_weight.py [file name] -k [top k] -w [with weight=1 or 0]"
parser = OptionParser(USAGE)
parser.add_option("-k", dest="topK")
parser.add_option("-w", dest="withWeight")
opt, args = parser.parse_args()
if len(args) < 1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK is None:
topK = 10
else:
topK = int(opt.topK)
if opt.withWeight is None:
withWeight = False
else:
if int(opt.withWeight) is 1:
withWeight = True
else:
withWeight = False
content = open(file_name, 'rb').read()
tags = jieba.analyse.extract_tags(content, topK=topK, withWeight=withWeight)
if withWeight is True:
for tag in tags:
print("tag: %s\t\t weight: %f" % (tag[0],tag[1]))
else:
print(",".join(tags))

63
test/extract_topic.py Normal file
View File

@ -0,0 +1,63 @@
import sys
sys.path.append("../")
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import decomposition
import jieba
import time
import glob
import sys
import os
import random
if len(sys.argv)<2:
print("usage: extract_topic.py directory [n_topic] [n_top_words]")
sys.exit(0)
n_topic = 10
n_top_words = 25
if len(sys.argv)>2:
n_topic = int(sys.argv[2])
if len(sys.argv)>3:
n_top_words = int(sys.argv[3])
count_vect = CountVectorizer()
docs = []
pattern = os.path.join(sys.argv[1],"*.txt")
print("read "+pattern)
for f_name in glob.glob(pattern):
with open(f_name) as f:
print("read file:", f_name)
for line in f: #one line as a document
words = " ".join(jieba.cut(line))
docs.append(words)
random.shuffle(docs)
print("read done.")
print("transform")
counts = count_vect.fit_transform(docs)
tfidf = TfidfTransformer().fit_transform(counts)
print(tfidf.shape)
t0 = time.time()
print("training...")
nmf = decomposition.NMF(n_components=n_topic).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))
# Inverse the vectorizer vocabulary to be able
feature_names = count_vect.get_feature_names()
for topic_idx, topic in enumerate(nmf.components_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print("")

1
test/foobar.txt Normal file
View File

@ -0,0 +1 @@
好人 12 n

205
test/jieba_test.py Normal file
View File

@ -0,0 +1,205 @@
#-*-coding: utf-8 -*-
from __future__ import unicode_literals, print_function
import sys
sys.path.append("../")
import unittest
import types
import jieba
if sys.version_info[0] > 2:
from imp import reload
jieba.initialize()
test_contents = [
"这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。",
"我不喜欢日本和服。",
"雷猴回归人间。",
"工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作",
"我需要廉租房",
"永和服装饰品有限公司",
"我爱北京天安门",
"abc",
"隐马尔可夫",
"雷猴是个好网站",
"“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成",
"草泥马和欺实马是今年的流行词汇",
"伊藤洋华堂总府店",
"中国科学院计算技术研究所",
"罗密欧与朱丽叶",
"我购买了道具和服装",
"PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍",
"湖北省石首市",
"湖北省十堰市",
"总经理完成了这件事情",
"电脑修好了",
"做好了这件事情就一了百了了",
"人们审美的观点是不同的",
"我们买了一个美的空调",
"线程初始化时我们要注意",
"一个分子是由好多原子组织成的",
"祝你马到功成",
"他掉进了无底洞里",
"中国的首都是北京",
"孙君意",
"外交部发言人马朝旭",
"领导人会议和第四届东亚峰会",
"在过去的这五年",
"还需要很长的路要走",
"60周年首都阅兵",
"你好人们审美的观点是不同的",
"买水果然后来世博园",
"买水果然后去世博园",
"但是后来我才知道你是对的",
"存在即合理",
"的的的的的在的的的的就以和和和",
"I love你不以为耻反以为rong",
"",
"",
"hello你好人们审美的观点是不同的",
"很好但主要是基于网页形式",
"hello你好人们审美的观点是不同的",
"为什么我不能拥有想要的生活",
"后来我才",
"此次来中国是为了",
"使用了它就可以解决一些问题",
",使用了它就可以解决一些问题",
"其实使用了它就可以解决一些问题",
"好人使用了它就可以解决一些问题",
"是因为和国家",
"老年搜索还支持",
"干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ",
"",
"",
"他说的确实在理",
"长春市长春节讲话",
"结婚的和尚未结婚的",
"结合成分子时",
"旅游和服务是最好的",
"这件事情的确是我的错",
"供大家参考指正",
"哈尔滨政府公布塌桥原因",
"我在机场入口处",
"邢永臣摄影报道",
"BP神经网络如何训练才能在分类时增加区分度",
"南京市长江大桥",
"应一些使用者的建议也为了便于利用NiuTrans用于SMT研究",
'长春市长春药店',
'邓颖超生前最喜欢的衣服',
'胡锦涛是热爱世界和平的政治局常委',
'程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪',
'一次性交多少钱',
'两块五一套,三块八一斤,四块七一本,五块六一条',
'小和尚留了一个像大和尚一样的和尚头',
'我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站',
'张晓梅去人民医院做了个B超然后去买了件T恤',
'AT&T是一件不错的公司给你发offer了吗',
'C++和c#是什么关系11+122=133是吗PI=3.14159',
'你认识那个和主席握手的的哥吗?他开一辆黑色的士。',
'枪杆子中出政权']
class JiebaTestCase(unittest.TestCase):
def setUp(self):
reload(jieba)
def tearDown(self):
pass
def testDefaultCut(self):
for content in test_contents:
result = jieba.cut(content)
assert isinstance(result, types.GeneratorType), "Test DefaultCut Generator error"
result = list(result)
assert isinstance(result, list), "Test DefaultCut error on content: %s" % content
print(" , ".join(result), file=sys.stderr)
print("testDefaultCut", file=sys.stderr)
def testCutAll(self):
for content in test_contents:
result = jieba.cut(content, cut_all=True)
assert isinstance(result, types.GeneratorType), "Test CutAll Generator error"
result = list(result)
assert isinstance(result, list), "Test CutAll error on content: %s" % content
print(" , ".join(result), file=sys.stderr)
print("testCutAll", file=sys.stderr)
def testSetDictionary(self):
jieba.set_dictionary("foobar.txt")
for content in test_contents:
result = jieba.cut(content)
assert isinstance(result, types.GeneratorType), "Test SetDictionary Generator error"
result = list(result)
assert isinstance(result, list), "Test SetDictionary error on content: %s" % content
print(" , ".join(result), file=sys.stderr)
print("testSetDictionary", file=sys.stderr)
def testCutForSearch(self):
for content in test_contents:
result = jieba.cut_for_search(content)
assert isinstance(result, types.GeneratorType), "Test CutForSearch Generator error"
result = list(result)
assert isinstance(result, list), "Test CutForSearch error on content: %s" % content
print(" , ".join(result), file=sys.stderr)
print("testCutForSearch", file=sys.stderr)
def testPosseg(self):
import jieba.posseg as pseg
for content in test_contents:
result = pseg.cut(content)
assert isinstance(result, types.GeneratorType), "Test Posseg Generator error"
result = list(result)
assert isinstance(result, list), "Test Posseg error on content: %s" % content
print(" , ".join([w.word + " / " + w.flag for w in result]), file=sys.stderr)
print("testPosseg", file=sys.stderr)
def testTokenize(self):
for content in test_contents:
result = jieba.tokenize(content)
assert isinstance(result, types.GeneratorType), "Test Tokenize Generator error"
result = list(result)
assert isinstance(result, list), "Test Tokenize error on content: %s" % content
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]), file=sys.stderr)
print("testTokenize", file=sys.stderr)
def testDefaultCut_NOHMM(self):
for content in test_contents:
result = jieba.cut(content,HMM=False)
assert isinstance(result, types.GeneratorType), "Test DefaultCut Generator error"
result = list(result)
assert isinstance(result, list), "Test DefaultCut error on content: %s" % content
print(" , ".join(result), file=sys.stderr)
print("testDefaultCut_NOHMM", file=sys.stderr)
def testPosseg_NOHMM(self):
import jieba.posseg as pseg
for content in test_contents:
result = pseg.cut(content,HMM=False)
assert isinstance(result, types.GeneratorType), "Test Posseg Generator error"
result = list(result)
assert isinstance(result, list), "Test Posseg error on content: %s" % content
print(" , ".join([w.word + " / " + w.flag for w in result]), file=sys.stderr)
print("testPosseg_NOHMM", file=sys.stderr)
def testTokenize_NOHMM(self):
for content in test_contents:
result = jieba.tokenize(content,HMM=False)
assert isinstance(result, types.GeneratorType), "Test Tokenize Generator error"
result = list(result)
assert isinstance(result, list), "Test Tokenize error on content: %s" % content
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]), file=sys.stderr)
print("testTokenize_NOHMM", file=sys.stderr)
def testCutForSearch_NOHMM(self):
for content in test_contents:
result = jieba.cut_for_search(content,HMM=False)
assert isinstance(result, types.GeneratorType), "Test CutForSearch Generator error"
result = list(result)
assert isinstance(result, list), "Test CutForSearch error on content: %s" % content
print(" , ".join(result), file=sys.stderr)
print("testCutForSearch_NOHMM", file=sys.stderr)
if __name__ == "__main__":
unittest.main()

View File

@ -6,7 +6,7 @@ cat abc.txt | python jiebacmd.py | sort | uniq -c | sort -nr -k1 | head -100
'''
from __future__ import unicode_literals
import sys
sys.path.append("../")
@ -15,14 +15,14 @@ import jieba
default_encoding='utf-8'
if len(sys.argv)>1:
default_encoding = sys.argv[1]
default_encoding = sys.argv[1]
while True:
line = sys.stdin.readline()
if line=="":
break
line = line.strip()
for word in jieba.cut(line):
print word.encode(default_encoding)
line = sys.stdin.readline()
if line=="":
break
line = line.strip()
for word in jieba.cut(line):
print(word)

44
test/lyric.txt Normal file
View File

@ -0,0 +1,44 @@
我沒有心
我沒有真實的自我
我只有消瘦的臉孔
所謂軟弱
所謂的順從一向是我
的座右銘
而我
沒有那海洋的寬闊
我只要熱情的撫摸
所謂空洞
所謂不安全感是我
的墓誌銘
而你
是否和我一般怯懦
是否和我一般矯作
和我一般囉唆
而你
是否和我一般退縮
是否和我一般肌迫
一般地困惑
我沒有力
我沒有滿腔的熱火
我只有滿肚的如果
所謂勇氣
所謂的認同感是我
隨便說說
而你
是否和我一般怯懦
是否和我一般矯作
是否對你來說
只是一場遊戲
雖然沒有把握
而你
是否和我一般退縮
是否和我一般肌迫
是否對你來說
只是逼不得已
雖然沒有藉口

View File

@ -0,0 +1,34 @@
import sys
sys.path.append('../../')
import jieba
jieba.enable_parallel(4)
import jieba.analyse
from optparse import OptionParser
USAGE ="usage: python extract_tags.py [file name] -k [top k]"
parser = OptionParser(USAGE)
parser.add_option("-k",dest="topK")
opt, args = parser.parse_args()
if len(args) <1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK==None:
topK=10
else:
topK = int(opt.topK)
content = open(file_name,'rb').read()
tags = jieba.analyse.extract_tags(content,topK=topK)
print(",".join(tags))

99
test/parallel/test.py Normal file
View File

@ -0,0 +1,99 @@
#encoding=utf-8
from __future__ import print_function
import sys
sys.path.append("../../")
import jieba
jieba.enable_parallel(4)
def cuttest(test_sent):
result = jieba.cut(test_sent)
for word in result:
print(word, "/", end=' ')
print("")
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')

95
test/parallel/test2.py Normal file
View File

@ -0,0 +1,95 @@
#encoding=utf-8
from __future__ import print_function
import sys
sys.path.append("../../")
import jieba
jieba.enable_parallel(4)
def cuttest(test_sent):
result = jieba.cut(test_sent,cut_all=True)
for word in result:
print(word, "/", end=' ')
print("")
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')

View File

@ -0,0 +1,95 @@
#encoding=utf-8
from __future__ import print_function
import sys
sys.path.append("../../")
import jieba
jieba.enable_parallel(4)
def cuttest(test_sent):
result = jieba.cut_for_search(test_sent)
for word in result:
print(word, "/", end=' ')
print("")
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')

View File

@ -0,0 +1,95 @@
#encoding=utf-8
from __future__ import print_function
import sys
sys.path.append("../../")
import jieba
jieba.enable_parallel(4)
def cuttest(test_sent):
result = jieba.cut(test_sent, HMM=False)
for word in result:
print(word, "/", end=' ')
print("")
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')

View File

@ -0,0 +1,20 @@
import sys
import time
sys.path.append("../../")
import jieba
jieba.enable_parallel()
url = sys.argv[1]
content = open(url,"rb").read()
t1 = time.time()
words = "/ ".join(jieba.cut(content))
t2 = time.time()
tm_cost = t2-t1
log_f = open("1.log","wb")
log_f.write(words.encode('utf-8'))
print('speed %s bytes/second' % (len(content)/tm_cost))

100
test/parallel/test_pos.py Normal file
View File

@ -0,0 +1,100 @@
#encoding=utf-8
from __future__ import print_function
import sys
sys.path.append("../../")
import jieba
jieba.enable_parallel(4)
import jieba.posseg as pseg
def cuttest(test_sent):
result = pseg.cut(test_sent)
for w in result:
print(w.word, "/", w.flag, ", ", end=' ')
print("")
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')

View File

@ -0,0 +1,22 @@
from __future__ import print_function
import sys,time
import sys
sys.path.append("../../")
import jieba
import jieba.posseg as pseg
jieba.enable_parallel(4)
url = sys.argv[1]
content = open(url,"rb").read()
t1 = time.time()
words = list(pseg.cut(content))
t2 = time.time()
tm_cost = t2-t1
log_f = open("1.log","w")
log_f.write(' / '.join(map(str, words)))
print('speed' , len(content)/tm_cost, " bytes/second")

View File

@ -3,91 +3,100 @@ import sys
sys.path.append("../")
import jieba
def cuttest(test_sent):
result = jieba.cut(test_sent)
for word in result:
print word, "/",
print ""
result = jieba.cut(test_sent)
print(" / ".join(result))
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('枪杆子中出政权')
cuttest('张三风同学走上了不归路')
cuttest('阿Q腰间挂着BB机手里拿着大哥大我一般吃饭不AA制的。')
cuttest('在1号店能买到小S和大S八卦的书还有3D电视。')
jieba.del_word('很赞')
cuttest('看上去iphone8手机样式很赞,售价699美元,销量涨了5%么?')

View File

@ -1,93 +0,0 @@
#encoding=utf-8
import sys
sys.path.append("../")
import jieba
def cuttest(test_sent):
result = jieba.cut(test_sent,cut_all=True)
for word in result:
print word, "/",
print ""
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')

10
test/test_bug.py Normal file
View File

@ -0,0 +1,10 @@
#encoding=utf-8
from __future__ import print_function
import sys
sys.path.append("../")
import jieba
import jieba.posseg as pseg
words=pseg.cut("又跛又啞")
for w in words:
print(w.word,w.flag)

View File

@ -0,0 +1,28 @@
#encoding=utf-8
from __future__ import print_function
import sys
sys.path.append("../")
import jieba
def cuttest(test_sent):
result = jieba.cut(test_sent)
print(" ".join(result))
def testcase():
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
if __name__ == "__main__":
testcase()
jieba.set_dictionary("foobar.txt")
print("================================")
testcase()

View File

@ -1,93 +1,98 @@
#encoding=utf-8
from __future__ import print_function
import sys
sys.path.append("../")
import jieba
def cuttest(test_sent):
result = jieba.cut_for_search(test_sent)
for word in result:
print word, "/",
print ""
result = jieba.cut_for_search(test_sent)
for word in result:
print(word, "/", end=' ')
print("")
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')

101
test/test_cutall.py Normal file
View File

@ -0,0 +1,101 @@
#encoding=utf-8
from __future__ import print_function
import sys
sys.path.append("../")
import jieba
def cuttest(test_sent):
result = jieba.cut(test_sent,cut_all=True)
for word in result:
print(word, "/", end=' ')
print("")
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
jieba.add_word('超敏C反应蛋白')
cuttest('超敏C反应蛋白是什么, java好学吗?,小潘老板都学Python')
cuttest('steel健身爆发力运动兴奋补充剂')

View File

@ -1,20 +1,21 @@
import urllib2
import sys,time
import time
import sys
sys.path.append("../")
import jieba
jieba.initialize()
url = sys.argv[1]
content = open(url,"rb").read()
t1 = time.time()
words = list(jieba.cut(content))
words = "/ ".join(jieba.cut(content))
t2 = time.time()
tm_cost = t2-t1
log_f = open("1.log","wb")
for w in words:
print >> log_f, w.encode("gbk"), "/" ,
log_f.write(words.encode('utf-8'))
log_f.close()
print 'speed' , len(content)/tm_cost, " bytes/second"
print('cost ' + str(tm_cost))
print('speed %s bytes/second' % (len(content)/tm_cost))

42
test/test_lock.py Normal file
View File

@ -0,0 +1,42 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import jieba
import threading
def inittokenizer(tokenizer, group):
print('===> Thread %s:%s started' % (group, threading.current_thread().ident))
tokenizer.initialize()
print('<=== Thread %s:%s finished' % (group, threading.current_thread().ident))
tokrs1 = [jieba.Tokenizer() for n in range(5)]
tokrs2 = [jieba.Tokenizer('../extra_dict/dict.txt.small') for n in range(5)]
thr1 = [threading.Thread(target=inittokenizer, args=(tokr, 1)) for tokr in tokrs1]
thr2 = [threading.Thread(target=inittokenizer, args=(tokr, 2)) for tokr in tokrs2]
for thr in thr1:
thr.start()
for thr in thr2:
thr.start()
for thr in thr1:
thr.join()
for thr in thr2:
thr.join()
del tokrs1, tokrs2
print('='*40)
tokr1 = jieba.Tokenizer()
tokr2 = jieba.Tokenizer('../extra_dict/dict.txt.small')
thr1 = [threading.Thread(target=inittokenizer, args=(tokr1, 1)) for n in range(5)]
thr2 = [threading.Thread(target=inittokenizer, args=(tokr2, 2)) for n in range(5)]
for thr in thr1:
thr.start()
for thr in thr2:
thr.start()
for thr in thr1:
thr.join()
for thr in thr2:
thr.join()

29
test/test_multithread.py Normal file
View File

@ -0,0 +1,29 @@
#encoding=utf-8
import sys
import threading
sys.path.append("../")
import jieba
class Worker(threading.Thread):
def run(self):
seg_list = jieba.cut("我来到北京清华大学",cut_all=True)
print("Full Mode:" + "/ ".join(seg_list)) #全模式
seg_list = jieba.cut("我来到北京清华大学",cut_all=False)
print("Default Mode:" + "/ ".join(seg_list)) #默认模式
seg_list = jieba.cut("他来到了网易杭研大厦")
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式
print(", ".join(seg_list))
workers = []
for i in range(10):
worker = Worker()
workers.append(worker)
worker.start()
for worker in workers:
worker.join()

100
test/test_no_hmm.py Normal file
View File

@ -0,0 +1,100 @@
#encoding=utf-8
import sys
sys.path.append("../")
import jieba
def cuttest(test_sent):
result = jieba.cut(test_sent,HMM=False)
print(" / ".join(result))
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('枪杆子中出政权')
cuttest('张三风同学走上了不归路')
cuttest('阿Q腰间挂着BB机手里拿着大哥大我一般吃饭不AA制的。')
cuttest('在1号店能买到小S和大S八卦的书还有3D电视。')

102
test/test_paddle.py Normal file
View File

@ -0,0 +1,102 @@
#encoding=utf-8
import sys
sys.path.append("../")
import jieba
jieba.enable_paddle()
def cuttest(test_sent):
result = jieba.cut(test_sent, use_paddle=True)
print(" / ".join(result))
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('枪杆子中出政权')
cuttest('张三风同学走上了不归路')
cuttest('阿Q腰间挂着BB机手里拿着大哥大我一般吃饭不AA制的。')
cuttest('在1号店能买到小S和大S八卦的书还有3D电视。')
jieba.del_word('很赞')
cuttest('看上去iphone8手机样式很赞,售价699美元,销量涨了5%么?')

102
test/test_paddle_postag.py Normal file
View File

@ -0,0 +1,102 @@
#encoding=utf-8
import sys
sys.path.append("../")
import jieba.posseg as pseg
import jieba
jieba.enable_paddle()
def cuttest(test_sent):
result = pseg.cut(test_sent, use_paddle=True)
for word, flag in result:
print('%s %s' % (word, flag))
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('枪杆子中出政权')
cuttest('张三风同学走上了不归路')
cuttest('阿Q腰间挂着BB机手里拿着大哥大我一般吃饭不AA制的。')
cuttest('在1号店能买到小S和大S八卦的书还有3D电视。')

View File

@ -1,93 +1,99 @@
#encoding=utf-8
from __future__ import print_function
import sys
sys.path.append("../")
import jieba.posseg as pseg
def cuttest(test_sent):
result = pseg.cut(test_sent)
for w in result:
print w.word, "/", w.flag, ", ",
print ""
result = pseg.cut(test_sent)
for word, flag in result:
print(word, "/", flag, ", ", end=' ')
print("")
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('枪杆子中出政权')

View File

@ -1,7 +1,9 @@
import urllib2
import sys,time
from __future__ import print_function
import sys
import time
sys.path.append("../")
import jieba
jieba.initialize()
import jieba.posseg as pseg
url = sys.argv[1]
@ -12,9 +14,8 @@ words = list(pseg.cut(content))
t2 = time.time()
tm_cost = t2-t1
log_f = open("1.log","wb")
for w in words:
print >> log_f, w.encode("gbk"), "/" ,
log_f = open("1.log","w")
log_f.write(' / '.join(map(str, words)))
print 'speed' , len(content)/tm_cost, " bytes/second"
print('speed' , len(content)/tm_cost, " bytes/second")

99
test/test_pos_no_hmm.py Normal file
View File

@ -0,0 +1,99 @@
#encoding=utf-8
from __future__ import print_function
import sys
sys.path.append("../")
import jieba.posseg as pseg
def cuttest(test_sent):
result = pseg.cut(test_sent, HMM=False)
for word, flag in result:
print(word, "/", flag, ", ", end=' ')
print("")
if __name__ == "__main__":
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('枪杆子中出政权')

106
test/test_tokenize.py Normal file
View File

@ -0,0 +1,106 @@
#encoding=utf-8
from __future__ import print_function,unicode_literals
import sys
sys.path.append("../")
import jieba
g_mode="default"
def cuttest(test_sent):
global g_mode
result = jieba.tokenize(test_sent,mode=g_mode)
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
if __name__ == "__main__":
for m in ("default","search"):
g_mode = m
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('枪杆子中出政权')
cuttest('张三风同学走上了不归路')
cuttest('阿Q腰间挂着BB机手里拿着大哥大我一般吃饭不AA制的。')
cuttest('在1号店能买到小S和大S八卦的书。')

View File

@ -0,0 +1,106 @@
#encoding=utf-8
from __future__ import print_function,unicode_literals
import sys
sys.path.append("../")
import jieba
g_mode="default"
def cuttest(test_sent):
global g_mode
result = jieba.tokenize(test_sent,mode=g_mode,HMM=False)
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
if __name__ == "__main__":
for m in ("default","search"):
g_mode = m
cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空我爱北京我爱Python和C++。")
cuttest("我不喜欢日本和服。")
cuttest("雷猴回归人间。")
cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")
cuttest("我需要廉租房")
cuttest("永和服装饰品有限公司")
cuttest("我爱北京天安门")
cuttest("abc")
cuttest("隐马尔可夫")
cuttest("雷猴是个好网站")
cuttest("“Microsoft”一词由“MICROcomputer微型计算机”和“SOFTware软件”两部分组成")
cuttest("草泥马和欺实马是今年的流行词汇")
cuttest("伊藤洋华堂总府店")
cuttest("中国科学院计算技术研究所")
cuttest("罗密欧与朱丽叶")
cuttest("我购买了道具和服装")
cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍")
cuttest("湖北省石首市")
cuttest("湖北省十堰市")
cuttest("总经理完成了这件事情")
cuttest("电脑修好了")
cuttest("做好了这件事情就一了百了了")
cuttest("人们审美的观点是不同的")
cuttest("我们买了一个美的空调")
cuttest("线程初始化时我们要注意")
cuttest("一个分子是由好多原子组织成的")
cuttest("祝你马到功成")
cuttest("他掉进了无底洞里")
cuttest("中国的首都是北京")
cuttest("孙君意")
cuttest("外交部发言人马朝旭")
cuttest("领导人会议和第四届东亚峰会")
cuttest("在过去的这五年")
cuttest("还需要很长的路要走")
cuttest("60周年首都阅兵")
cuttest("你好人们审美的观点是不同的")
cuttest("买水果然后来世博园")
cuttest("买水果然后去世博园")
cuttest("但是后来我才知道你是对的")
cuttest("存在即合理")
cuttest("的的的的的在的的的的就以和和和")
cuttest("I love你不以为耻反以为rong")
cuttest("")
cuttest("")
cuttest("hello你好人们审美的观点是不同的")
cuttest("很好但主要是基于网页形式")
cuttest("hello你好人们审美的观点是不同的")
cuttest("为什么我不能拥有想要的生活")
cuttest("后来我才")
cuttest("此次来中国是为了")
cuttest("使用了它就可以解决一些问题")
cuttest(",使用了它就可以解决一些问题")
cuttest("其实使用了它就可以解决一些问题")
cuttest("好人使用了它就可以解决一些问题")
cuttest("是因为和国家")
cuttest("老年搜索还支持")
cuttest("干脆就把那部蒙人的闲法给废了拉倒RT @laoshipukong : 27日全国人大常委会第三次审议侵权责任法草案删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ")
cuttest("")
cuttest("")
cuttest("他说的确实在理")
cuttest("长春市长春节讲话")
cuttest("结婚的和尚未结婚的")
cuttest("结合成分子时")
cuttest("旅游和服务是最好的")
cuttest("这件事情的确是我的错")
cuttest("供大家参考指正")
cuttest("哈尔滨政府公布塌桥原因")
cuttest("我在机场入口处")
cuttest("邢永臣摄影报道")
cuttest("BP神经网络如何训练才能在分类时增加区分度")
cuttest("南京市长江大桥")
cuttest("应一些使用者的建议也为了便于利用NiuTrans用于SMT研究")
cuttest('长春市长春药店')
cuttest('邓颖超生前最喜欢的衣服')
cuttest('胡锦涛是热爱世界和平的政治局常委')
cuttest('程序员祝海林和朱会震是在孙健的左面和右面, 范凯在最右面.再往左是李松洪')
cuttest('一次性交多少钱')
cuttest('两块五一套,三块八一斤,四块七一本,五块六一条')
cuttest('小和尚留了一个像大和尚一样的和尚头')
cuttest('我是中华人民共和国公民;我爸爸是共和党党员; 地铁和平门站')
cuttest('张晓梅去人民医院做了个B超然后去买了件T恤')
cuttest('AT&T是一件不错的公司给你发offer了吗')
cuttest('C++和c#是什么关系11+122=133是吗PI=3.14159')
cuttest('你认识那个和主席握手的的哥吗?他开一辆黑色的士。')
cuttest('枪杆子中出政权')
cuttest('张三风同学走上了不归路')
cuttest('阿Q腰间挂着BB机手里拿着大哥大我一般吃饭不AA制的。')
cuttest('在1号店能买到小S和大S八卦的书。')

48
test/test_userdict.py Normal file
View File

@ -0,0 +1,48 @@
#encoding=utf-8
from __future__ import print_function, unicode_literals
import sys
sys.path.append("../")
import jieba
jieba.load_userdict("userdict.txt")
import jieba.posseg as pseg
jieba.add_word('石墨烯')
jieba.add_word('凱特琳')
jieba.del_word('自定义词')
test_sent = (
"李小福是创新办主任也是云计算方面的专家; 什么是八一双鹿\n"
"例如我输入一个带“韩玉赏鉴”的标题在自定义词库中也增加了此词为N类\n"
"「台中」正確應該不會被切開。mac上可分出「石墨烯」此時又可以分出來凱特琳了。"
)
words = jieba.cut(test_sent)
print('/'.join(words))
print("="*40)
result = pseg.cut(test_sent)
for w in result:
print(w.word, "/", w.flag, ", ", end=' ')
print("\n" + "="*40)
terms = jieba.cut('easy_install is great')
print('/'.join(terms))
terms = jieba.cut('python 的正则表达式是好用的')
print('/'.join(terms))
print("="*40)
# test frequency tune
testlist = [
('今天天气不错', ('今天', '天气')),
('如果放到post中将出错。', ('', '')),
('我们中出了一个叛徒', ('', '')),
]
for sent, seg in testlist:
print('/'.join(jieba.cut(sent, HMM=False)))
word = ''.join(seg)
print('%s Before: %s, After: %s' % (word, jieba.get_FREQ(word), jieba.suggest_freq(seg, True)))
print('/'.join(jieba.cut(sent, HMM=False)))
print("-"*40)

64
test/test_whoosh.py Normal file
View File

@ -0,0 +1,64 @@
# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
import sys,os
sys.path.append("../")
from whoosh.index import create_in,open_dir
from whoosh.fields import *
from whoosh.qparser import QueryParser
from jieba.analyse.analyzer import ChineseAnalyzer
analyzer = ChineseAnalyzer()
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True, analyzer=analyzer))
if not os.path.exists("tmp"):
os.mkdir("tmp")
ix = create_in("tmp", schema) # for create new index
#ix = open_dir("tmp") # for read only
writer = ix.writer()
writer.add_document(
title="document1",
path="/a",
content="This is the first document weve added!"
)
writer.add_document(
title="document2",
path="/b",
content="The second one 你 中文测试中文 is even more interesting! 吃水果"
)
writer.add_document(
title="document3",
path="/c",
content="买水果然后来世博园。"
)
writer.add_document(
title="document4",
path="/c",
content="工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作"
)
writer.add_document(
title="document4",
path="/c",
content="咱俩交换一下吧。"
)
writer.commit()
searcher = ix.searcher()
parser = QueryParser("content", schema=ix.schema)
for keyword in ("水果世博园","","first","中文","交换机","交换"):
print("result of ",keyword)
q = parser.parse(keyword)
results = searcher.search(q)
for hit in results:
print(hit.highlights("content"))
print("="*10)
for t in analyzer("我的好朋友是李明;我爱北京天安门;IBM和Microsoft; I have a dream. this is intetesting and interested me a lot"):
print(t.text)

Some files were not shown because too many files have changed in this diff Show More