104 Commits

Author SHA1 Message Date
Dingyuan Wang
99d0fb1a8a use regex and fix encoding related issues in load_userdict 2015-11-09 20:54:50 +08:00
Dingyuan Wang
ceb5c26be4 fix self.FREQ in cut_for_search; make pair object iterable 2015-06-01 14:36:38 +08:00
Dingyuan Wang
3b76328f2a allow ignoring word frequency while providing pos tag 2015-05-23 21:51:00 +08:00
Dingyuan Wang
94840a734c wraps most globals in classes
API changes:
* class jieba.Tokenizer, jieba.posseg.POSTokenizer
* class jieba.analyse.TFIDF, jieba.analyse.TextRank
* global functions are mapped to jieba.(posseg.)dt, the default (POS)Tokenizer
* multiprocessing only works with jieba.(posseg.)dt
* new lcut, lcut_for_search functions that returns a list
* jieba.analyse.textrank now returns 20 items by default

Tests:
* added test_lock.py to test multithread locking
* demo.py now contains most of the examples in README
2015-05-09 21:29:05 +08:00
Dingyuan Wang
4a552ca94f suggest word frequency, support passing str to add_word 2015-03-14 12:44:19 +08:00
Dingyuan Wang
872a7039f2 Merge branch 'master' of https://github.com/fxsjy/jieba 2015-02-12 10:33:56 +08:00
Dingyuan Wang
f808ea0ebb use only one dict to store words and prefixes 2015-02-12 10:31:52 +08:00
fxsjy
5bfa43a781 fix test scripts 2015-02-11 20:46:48 +08:00
Dingyuan Wang
f3a53dd2da fix print() in tests 2015-02-11 20:45:55 +08:00
fxsjy
8cbb26a7b6 fix test_file.py 2015-02-11 16:47:57 +08:00
Dingyuan Wang
22bcf8be7a Merge master and jieba3k, make the code Python 2/3 compatible 2015-02-10 20:54:55 +08:00
Dingyuan Wang
3dad899ec8 backport 2to3 scripts and changelog 2014-11-29 16:12:25 +08:00
Dingyuan Wang
c6b386f65b update jieba3k 2014-11-29 16:06:20 +08:00
Dingyuan Wang
a5ecf70f71 update to v0.35 2014-11-14 20:59:54 +08:00
Dingyuan Wang
4a6140081e fix problems in auto2to3 2014-11-07 23:47:57 +08:00
Dingyuan Wang
7a6caa0c3c port extract_tags, etc to jieba3k; add auto2to3 script 2014-11-07 23:33:31 +08:00
walkskyer
6772f0282e 修复带权重测试脚本输出结果是调用顺序错误 2014-11-06 22:24:43 +08:00
Dingyuan Wang
fd9f1f2c0e update README, textrank, etc. 2014-10-25 14:23:37 +08:00
fxsjy
f5ca87e088 merge change of @fukuball 2014-10-23 15:59:08 +08:00
Dingyuan Wang
bb1e6000c6 fix version; fix spaces at end of line 2014-10-19 10:57:46 +08:00
Dingyuan Wang
51df77831b use prefix dict instead of trie, add a command line interface, and a few small improvements 2014-10-18 22:23:26 +08:00
Dingyuan Wang
6fad5fbb2c update to v0.33 2014-09-06 23:28:47 +08:00
Fukuball Lin
b658ee69cb 讓 jieba 可以自行增加 stop words 語料庫
1. 增加範例 stop words 語料庫
2. 為了讓 jieba 可以切換 stop words 語料庫,新增 set_stop_words 方法,並改寫 extract_tags
3. test 增加 extract_tags_stop_words.py 測試範例
2014-08-06 03:35:16 +08:00
Fukuball Lin
7198d562f1 讓 jieba 可以切換 idf 語料庫
1. 新增繁體中文 idf 語料庫
2. 為了讓 jieba 可以切換 iff 語料庫,新增 get_idf, set_idf_path 方法,並改寫 extract_tags
3. test 增加 extract_tags_idfpath
2014-08-05 22:55:13 +08:00
Dingyuan Wang
c04ccd0d12 Update to v0.32 according to the master branch. 2014-06-14 22:31:13 +08:00
fxsjy
18678d50c6 fix bug issue #132 2014-01-28 13:48:03 +08:00
gan
31d5845535 add better support for english. like input: 'this is interesting and interested me'-->output:'this interest interest',which 'interest' match 'interesting interested' 2013-09-09 11:54:30 +08:00
Sun Junyi
7e7fcc1184 add an option to disable HMM 2013-09-05 17:09:27 +08:00
ZoeyYoung
d49542c06e fix bug 2013-08-21 19:31:12 +08:00
ZoeyYoung
dce353f88b merge from master 2013-08-21 15:32:46 +08:00
ZoeyYoung
2857ae45cc Merge branch 'master' into jieba3k
Conflicts:
	Changelog
	jieba/__init__.py
	jieba/finalseg/__init__.py
	jieba/posseg/__init__.py
	setup.py
	test/parallel/test_file.py
	test/test_file.py
2013-08-21 13:55:21 +08:00
Sun Junyi
81390a2d23 test_file.py: close the file object 2013-08-02 15:51:33 +08:00
fxsjy
b77645b3aa modify test_file.py; use less memory 2013-07-29 10:17:39 +08:00
Linker Lin
5d83855088 自动检测CPU数目,启动合适数目的进程。 2013-07-28 00:12:00 +08:00
Linker Lin
2ceb981da0 自动检测CPU数目,启动合适数目的进程。 2013-07-28 00:07:29 +08:00
Sun Junyi
6549deabbd merge change from master 2013-07-16 11:06:41 +08:00
Cheng wei
6035bb6320 fix invalid syntax for python3 2013-07-06 02:52:17 +08:00
Sun Junyi
9d0ea771a5 fix bug; decimals & digit-english mixed 2013-07-05 16:16:49 +08:00
Sun Junyi
ba5114dc95 update whoosh example 2013-07-04 09:31:09 +08:00
Sun Junyi
f424862222 clean the files in tmp 2013-07-03 17:55:01 +08:00
Sun Junyi
b18d56d2a3 Merge pull request #72 from linkerlin/master
添加一个tmp目录,好让test_whoosh.py可以运行。
2013-07-03 02:52:46 -07:00
Sun Junyi
b9b1f1a418 fix conflict of merging 2013-07-03 17:47:45 +08:00
miao.lin
becd32b178 made test_whoosh.py happy.
添加一个tmp目录,好让test_whoosh.py可以运行。
2013-07-03 17:32:35 +08:00
Sun Junyi
c01680c6a8 merge the new file 2013-07-03 17:29:33 +08:00
Sun Junyi
b62f052927 PEP8 2013-07-03 17:21:21 +08:00
Sun Junyi
45daf561c7 follow PEP8: change tab to 4 white spaces 2013-07-03 16:58:22 +08:00
Sun Junyi
dbec3ad9df add some comments 2013-07-01 11:20:56 +08:00
Sun Junyi
efc784312c add ChineseAnalyzer for whoosh search engine 2013-07-01 10:53:39 +08:00
Sun Junyi
f08690a2df add 'search mode' for jieba.tokenize 2013-06-28 12:04:16 +08:00
Sun Junyi
cb1b0499f7 unittest for jieba.tokenize 2013-06-24 16:20:04 +08:00