65 Commits

Author SHA1 Message Date
Dingyuan Wang
bb1e6000c6 fix version; fix spaces at end of line 2014-10-19 10:57:46 +08:00
Dingyuan Wang
51df77831b use prefix dict instead of trie, add a command line interface, and a few small improvements 2014-10-18 22:23:26 +08:00
Fukuball Lin
b658ee69cb 讓 jieba 可以自行增加 stop words 語料庫
1. 增加範例 stop words 語料庫
2. 為了讓 jieba 可以切換 stop words 語料庫,新增 set_stop_words 方法,並改寫 extract_tags
3. test 增加 extract_tags_stop_words.py 測試範例
2014-08-06 03:35:16 +08:00
Fukuball Lin
7198d562f1 讓 jieba 可以切換 idf 語料庫
1. 新增繁體中文 idf 語料庫
2. 為了讓 jieba 可以切換 iff 語料庫,新增 get_idf, set_idf_path 方法,並改寫 extract_tags
3. test 增加 extract_tags_idfpath
2014-08-05 22:55:13 +08:00
fxsjy
18678d50c6 fix bug issue #132 2014-01-28 13:48:03 +08:00
gan
31d5845535 add better support for english. like input: 'this is interesting and interested me'-->output:'this interest interest',which 'interest' match 'interesting interested' 2013-09-09 11:54:30 +08:00
Sun Junyi
7e7fcc1184 add an option to disable HMM 2013-09-05 17:09:27 +08:00
Sun Junyi
81390a2d23 test_file.py: close the file object 2013-08-02 15:51:33 +08:00
fxsjy
b77645b3aa modify test_file.py; use less memory 2013-07-29 10:17:39 +08:00
Linker Lin
5d83855088 自动检测CPU数目,启动合适数目的进程。 2013-07-28 00:12:00 +08:00
Linker Lin
2ceb981da0 自动检测CPU数目,启动合适数目的进程。 2013-07-28 00:07:29 +08:00
Sun Junyi
9d0ea771a5 fix bug; decimals & digit-english mixed 2013-07-05 16:16:49 +08:00
Sun Junyi
ba5114dc95 update whoosh example 2013-07-04 09:31:09 +08:00
Sun Junyi
f424862222 clean the files in tmp 2013-07-03 17:55:01 +08:00
Sun Junyi
b18d56d2a3 Merge pull request #72 from linkerlin/master
添加一个tmp目录,好让test_whoosh.py可以运行。
2013-07-03 02:52:46 -07:00
miao.lin
becd32b178 made test_whoosh.py happy.
添加一个tmp目录,好让test_whoosh.py可以运行。
2013-07-03 17:32:35 +08:00
Sun Junyi
45daf561c7 follow PEP8: change tab to 4 white spaces 2013-07-03 16:58:22 +08:00
Sun Junyi
dbec3ad9df add some comments 2013-07-01 11:20:56 +08:00
Sun Junyi
efc784312c add ChineseAnalyzer for whoosh search engine 2013-07-01 10:53:39 +08:00
Sun Junyi
f08690a2df add 'search mode' for jieba.tokenize 2013-06-28 12:04:16 +08:00
Sun Junyi
cb1b0499f7 unittest for jieba.tokenize 2013-06-24 16:20:04 +08:00
Sun Junyi
11a3b10755 new method: jieba.tokenize 2013-06-24 16:14:11 +08:00
Sun Junyi
c0816b9bb0 more mixed words 2013-06-18 18:09:55 +08:00
Sun Junyi
c9e8da9e63 add more mix words to dict.txt 2013-06-18 14:10:36 +08:00
fxsjy
0087a4e7e3 adjust prob_trans for better support of name entity; fix some bad cases 2013-06-07 13:59:36 +08:00
Sun Junyi
4300f79788 add a example of using sklearn+jieba 2013-05-17 09:35:12 +08:00
Sun Junyi
a8f902545c fix some bad cases 2013-05-15 18:21:08 +08:00
cloudaice
9ee20a5293 add generator test 2013-05-11 22:50:30 +02:00
cloudaice
0c050b5eb2 add jieba.posseg test case 2013-05-11 17:40:43 +02:00
cloudaice
b0f9e6721e 添加cutall 测试用例 2013-05-11 17:40:43 +02:00
cloudaice
a7ff398edc 添加cut,set_dictionary,cut_for_search三个测试用例 2013-05-11 17:40:43 +02:00
cloudaice
667203a9ae 替换tab为空格,使用join代替循环 2013-05-11 17:40:43 +02:00
cloudaice
a2d2078465 将tab换成空格,使用is判断对象是否为None 2013-05-11 17:40:42 +02:00
cloudaice
e0434871eb 修改demo.py的代码格式,使得符合pep8规范 2013-05-11 17:40:42 +02:00
Sun Junyi
c1bf815343 update test case 2013-05-02 17:01:16 +08:00
Sun Junyi
94d455b079 hot fix of cut_all=True 2013-04-27 10:23:01 +08:00
Sun Junyi
59d5d3b811 fix bug and change version 2013-04-27 09:45:39 +08:00
fxsjy
8666428fb0 fix a bug of changing dictionary 2013-04-26 16:47:00 +08:00
fxsjy
9bebe6120b utf-8 output is more friendly to Linux 2013-04-26 16:19:00 +08:00
Sun Junyi
d3339633d5 in the speed test: initialize first to ignore the time of dict loading 2013-04-26 14:51:58 +08:00
fxsjy
bc049090a5 make lazy load thread safe 2013-04-26 12:54:05 +08:00
fxsjy
b46166f768 use CRLF as seperator to make chunks in parallel mode 2013-04-20 18:46:04 +08:00
fxsjy
6b83593b5a rm stub.log 2013-04-20 14:13:10 +08:00
fxsjy
62cf22121f new feature: parallel segment with multiprocessing 2013-04-20 14:11:31 +08:00
Sun Junyi
8d89e8afda handle 的 2013-04-19 10:02:33 +08:00
fxsjy
45591bb9ab support flag '_'; ignore white space 2013-04-12 21:53:03 +08:00
Sun Junyi
94ad7e7035 support decimal point 2013-04-08 09:53:04 +08:00
Sun Junyi
a383f035ba support decimal point: example PI=3.141569 = > PI / = / 3.14159 2013-04-08 09:38:49 +08:00
Sun Junyi
8e49199993 keep punctuation marks 2013-04-05 21:48:36 +08:00
Sun Junyi
58c363655c support user defined word tag 2013-03-25 17:28:37 +08:00