Compare commits

...

711 Commits
v2.1 ... master

Author SHA1 Message Date
Yanyi Wu
294755fab1 build: refine CMakeLists.txt by removing unnecessary conditions and options
- Eliminated the default installation prefix condition to streamline the configuration.
- Simplified the test build logic by ensuring tests are enabled only for top-level projects.
- Cleaned up redundant code for better readability and maintainability.
2025-05-03 07:43:25 +08:00
Yanyi Wu
714a297823 build: update CMakeLists.txt to include additional directories for test configuration
- Added include directories for the current binary and test directories to improve test file accessibility.
- Ensured proper configuration for test paths in the build process.
2025-05-02 23:47:37 +08:00
Yanyi Wu
c14131e3e2 refactor: clean up load_test.cpp by removing unused dependencies and tests
- Removed unused Jieba test and associated includes from load_test.cpp.
- Simplified main function to focus on essential operations.
- Ensured consistent exit handling by returning EXIT_SUCCESS.
2025-05-02 23:41:53 +08:00
Yanyi Wu
9cd64a1694 build: enhance test configuration and path management
- Added configuration for test paths in CMake to simplify file references.
- Updated load_test.cpp and various unit tests to use defined path macros for dictionary and test data files.
- Introduced test_paths.h.in to manage directory paths consistently across tests.
2025-05-02 23:33:18 +08:00
Yanyi Wu
aa410a69bb build: simplify test configuration in CMakeLists.txt
- Removed conditional check for MSVC when adding test commands.
- Ensured that test commands are always added regardless of the compiler.
2025-05-02 21:39:18 +08:00
Yanyi Wu
b5dc8e7a35 build: update .gitignore and CMakeLists for test configuration
- Added entries to .gitignore for temporary test files.
- Included a message to display MSVC value during build.
- Added UTF-8 compile option for MSVC in unittest CMakeLists.
2025-05-02 21:28:28 +08:00
Yanyi Wu
8141d8f434
Merge pull request #200 from yanyiwu/dev
fix: remove outdated entry from jieba dictionary
2025-05-02 17:31:29 +08:00
yanyiwu
9d8af2116e build: update CI workflow to include latest OS versions 2025-05-02 11:53:33 +08:00
yanyiwu
2185315643 fix: remove outdated entry from jieba dictionary 2025-05-02 11:38:31 +08:00
yanyiwu
340de007f9 docs: update README.md 2025-04-13 18:59:44 +08:00
yanyiwu
940ea02eb4 deps: upgrade limonp from v1.0.0 to v1.0.1 2025-04-12 17:54:01 +08:00
yanyiwu
3732abc0e5 docs: update CHANGELOG for v5.5.0 2025-04-12 10:07:40 +08:00
yanyiwu
9cda7f33e8 build: upgrade googletest from 1.11.0 to 1.12.1 2025-04-12 10:02:10 +08:00
Yanyi Wu
338603b676
Merge pull request #196 from ahmadov/ahmadov/fix-ns-2
avoid implicit namespaces
2025-04-11 08:59:41 +08:00
Elmi Ahmadov
d93dda397c avoid implicit namespaces
This PR fixes the ambigious `partial_sort` in KeywordExtractor.hpp.
We also have a definition for it and the compiler is consufed which
implementation should be used. To fix it, we can use the `std` namespace
explicitly.

Also, use the `std` namespace for the other data structures and include
their headers.
2025-04-10 19:10:05 +02:00
Yanyi Wu
7730deee52
Merge pull request #195 from ahmadov/ahmadov/fix-ns
fix missing includes and make namespaces explicit
2025-04-10 23:01:18 +08:00
Elmi Ahmadov
588860b5b6 fix missing includes and make namespaces explicit 2025-04-10 16:11:20 +02:00
Yanyi Wu
0523949aa8
Update stale-issues.yml 2025-04-05 17:26:58 +08:00
Yanyi Wu
b11fd29697
Update README.md 2025-03-08 17:33:48 +08:00
yanyiwu
15b8086a2a Add CMake workflow for Windows ARM64 builds
This commit introduces a new GitHub Actions workflow for building and testing CMake projects on Windows ARM64. The workflow includes steps for checking out the repository, configuring CMake with multiple C++ standards, building the project, and running tests. This enhancement supports continuous integration for ARM64 architecture, improving the project's build versatility.
2025-01-18 20:58:17 +08:00
yanyiwu
1d74caf705 Update CMake minimum version requirement to 3.10 2025-01-18 20:47:06 +08:00
Yanyi Wu
0c7c5228d0
Update README.md 2025-01-17 23:47:09 +08:00
yanyiwu
016fc17575 Improve error logging for UTF-8 decoding failures across cppjieba components. Updated error messages in DictTrie, PosTagger, PreFilter, and SegmentBase to provide clearer context on the specific input causing the failure. This change enhances the debugging experience when handling UTF-8 encoded strings. 2024-12-08 17:26:28 +08:00
yanyiwu
39fc58f081 Remove macOS 12 from CI workflow in cmake.yml 2024-12-08 17:03:39 +08:00
yanyiwu
42a93a4b98 Refactor decoding functions to use UTF-8 compliant methods
Updated multiple files to replace instances of DecodeRunesInString with DecodeUTF8RunesInString, ensuring proper handling of UTF-8 encoded strings. This change enhances the robustness of string decoding across the cppjieba library, including updates in DictTrie, HMMModel, PosTagger, PreFilter, SegmentBase, and Unicode files. Additionally, corresponding unit tests have been modified to reflect these changes.
2024-12-08 16:46:24 +08:00
yanyiwu
5ee74d788e [stale-isssues] Monthly on the 3rd day of the month at midnight 2024-11-03 17:22:28 +08:00
yanyiwu
9b45e084a3 v5.4.0 2024-09-22 10:02:53 +08:00
yanyiwu
aa1def5ddb class Jiaba unittest add default argument input 2024-09-22 09:43:04 +08:00
yanyiwu
732812cdfb class Jieba: support default dictpath 2024-09-22 09:38:31 +08:00
yanyiwu
6e167a30dd cmake: avoid testing when FetchContent by other project 2024-09-22 00:25:23 +08:00
yanyiwu
5ef74f335a Revert "cmake: enable windows/msvc test"
This reverts commit 63392627552b018ea018848c82965c263b0030fa.
2024-09-21 23:58:59 +08:00
yanyiwu
6339262755 cmake: enable windows/msvc test 2024-09-21 21:49:56 +08:00
yanyiwu
cc58d4f858 DictTrie: removed unused var 2024-09-21 21:29:55 +08:00
yanyiwu
dbebc7cacb cmake: enable windows/msvc test 2024-09-21 21:10:53 +08:00
yanyiwu
e5b98af199 v5.3.2 2024-09-21 20:45:46 +08:00
yanyiwu
e521f26456 removed test/demo.cpp and linked https://github.com/yanyiwu/cppjieba-demo 2024-09-21 17:26:19 +08:00
Yanyi Wu
30aaf7b9ad
Update Demo Link in README.md 2024-09-21 17:21:54 +08:00
yanyiwu
84bca4bc50 [github/actions] stale 1 year ago issues 2024-09-14 21:49:46 +08:00
yanyiwu
3c8663472b [github/actions] stale 3 years ago issues 2024-09-14 21:37:25 +08:00
yanyiwu
12341a2f21 [stale issues] Run weekly on Sunday at midnight 2024-09-11 21:41:15 +08:00
yanyiwu
165bee901c [github/actions] stale issues 2024-09-07 20:58:52 +08:00
yanyiwu
e691b631b2 limonp v0.9.0 -> v1.0.0 2024-09-07 17:21:59 +08:00
yanyiwu
31dfe0f9d0 v5.3.1 2024-08-17 17:21:44 +08:00
yanyiwu
a110ab10cc [cmake] fetch googletest 2024-08-16 10:13:07 +08:00
yanyiwu
fe88bd29ac [submodules] rm test/googletest 2024-08-16 10:08:36 +08:00
yanyiwu
00c8f8fa84 v5.3.0 2024-08-10 22:00:50 +08:00
yanyiwu
90174da597 [c++17,c++20] compatibility 2024-08-10 21:50:01 +08:00
yanyiwu
a7adc22a6e limonp version 0.6.7 -> 0.9.0 2024-08-10 21:47:27 +08:00
yanyiwu
30ab2c3860 v5.2.0 2024-07-28 23:25:58 +08:00
yanyiwu
3748b56928 [README] platform updated 2024-07-28 23:18:53 +08:00
yanyiwu
79a235223d [CI] windows-[2019,2022] 2024-07-28 23:16:11 +08:00
yanyiwu
c39fd30f93 [googletest] v1.6.0->v1.10.0 2024-07-28 22:41:20 +08:00
yanyiwu
f4c87c2ff4 [CI] ubuntu version from 20 to 22, macos version from 12 to 14 2024-07-28 22:32:46 +08:00
yanyiwu
f8d063101c [CMake] mini_required 2.6->3.5 and fix CXX_VERSION variable passed from cmd 2024-07-27 19:24:57 +08:00
yanyiwu
732fec41e6 [CI] matrix and multi cpp version 2024-07-27 18:58:36 +08:00
yanyiwu
bc162dbd84 v5.1.3 2024-07-22 22:53:53 +08:00
yanyiwu
8aad517375 git submodule add googletest-1.6.0 2024-07-22 22:36:45 +08:00
yanyiwu
4ec3204280 [Changelog] v5.1.2 2024-07-16 07:27:24 +08:00
yanyiwu
c83c7111ab README fix typo 2024-07-15 22:56:01 +08:00
yanyiwu
e334fc2ce0 [submodule:deps/limonp] upgrade to v0.6.7 2024-07-14 22:58:39 +08:00
yanyiwu
bc90a8276e rm useless code 2024-07-14 09:47:34 +08:00
yanyiwu
4b2f257a6a README_EN.md is useless for AI-age 2024-07-14 00:09:49 +08:00
unknown
7cf4502e01 commit from vscode 2024-07-13 20:03:58 +08:00
Yanyi Wu
a4e2b67017
Update README.md for windows env 2024-07-13 17:15:12 +08:00
Yanyi Wu
c22b6843d3
Update ChangeLog.md 2024-06-07 17:19:23 +08:00
Yanyi Wu
f4145fd08e
Merge pull request #186 from appotry/master
修正编译报错
2024-05-01 17:05:12 +08:00
夜法之书(appotry)
f91776997b
修正编译报错 2024-05-01 16:06:53 +08:00
wuyanyi
391121d5db release v5.1.0 2022-10-16 13:25:50 +08:00
Yanyi Wu
d869831996
Merge pull request #172 from yanyiwu/wyy
feature: add RemoveWord api from gojieba/pull/99
2022-10-16 13:24:40 +08:00
wuyanyi
03cc7c39ff feature: add RemoveWord api from https://github.com/yanyiwu/gojieba/pull/99 2022-10-16 13:17:19 +08:00
wuyanyi
302a367338 release v5.0.5 2022-10-16 12:43:01 +08:00
Yanyi Wu
1bf8e11833
Merge pull request #171 from yanyiwu/wyy
[submodule] update limonp to v0.6.6
2022-10-16 12:26:00 +08:00
wuyanyi
fc6e3f4294 [submodule] update limonp to v0.6.6 2022-10-16 12:10:35 +08:00
wuyanyi
db9b4b6813 [release v5.0.4] 2022-10-16 11:37:16 +08:00
Yanyi Wu
99b496d871
Merge pull request #168 from playgithub/limonp-as-submodule
limonp as submodule
2022-07-31 22:21:03 +08:00
abc
3269637644 remove .travis.yml 2022-07-31 21:25:29 +08:00
abc
fe9901858c remove appveyor.yml 2022-07-31 21:17:06 +08:00
abc
6a1d49d99b Update github workflows to checkout code with submodules 2022-07-22 10:41:50 +08:00
abc
8c93e0978d Update CMakeLists.txt to use limonp as a submodule 2022-07-22 10:36:35 +08:00
abc
01aba1d85d add submodule limonp 2022-07-22 10:34:52 +08:00
Yanyi Wu
194c144d8b
Merge pull request #166 from playgithub/add-license
Add MIT license
2022-07-10 21:51:23 +08:00
abc
1da8e7cde7 Fix end of line sequence, use LF instead of CRLF 2022-07-10 16:38:27 +08:00
abc
23ce5c7050 Add MIT license 2022-07-10 16:28:09 +08:00
Yanyi Wu
ef2b8f8b1b
Update README.md 2022-01-31 16:44:50 +08:00
Yanyi Wu
e81930b7c2
Update README.md 2022-01-09 21:07:50 +08:00
Yanyi Wu
466419dda0
Create cmake.yml 2022-01-09 18:25:10 +08:00
Yanyi Wu
acb2ecc125
Merge pull request #151 from wangfenjin/patch-2
add simple: sqlite3 fts5 tokenizer
2021-02-21 11:35:57 +08:00
Wang Fenjin
677b471d9f
add simple: sqlite3 fts5 tokenizer 2021-02-20 18:48:23 +08:00
Yanyi Wu
7a61f14b6b
Update README.md 2020-11-21 22:22:36 +08:00
Yanyi Wu
b799936d2f
Update README.md 2020-06-23 09:34:28 +08:00
Yanyi Wu
f66c9d1184
Merge pull request #145 from yanyiwu/yanyiwu-patch-1-1
add sponsorship
2020-06-21 01:15:59 +08:00
Yanyi Wu
1d4ffd4b3d
add sponsorship 2020-06-21 00:56:57 +08:00
yanyiwu
32fd1ef010 [release] v5.0.3 2020-03-11 09:30:52 +08:00
yanyiwu
7c046e393f Upgrade [limonp](https://github.com/yanyiwu/limonp) -> v0.6.3 2020-03-11 09:23:22 +08:00
yanyiwu
b6a1f5f21c [release] v5.0.2 2020-01-13 21:33:53 +08:00
yanyiwu
dfb9c1f010 Upgrade [limonp](https://github.com/yanyiwu/limonp) -> v0.6.1 2020-01-13 21:29:53 +08:00
Yanyi Wu
79ffd00979
Update README.md 2019-10-24 23:05:33 +08:00
yanyiwu
caf4c43ad6 release tag v5.0.1 2019-09-21 12:01:15 +08:00
Yanyi Wu
9aa7537096
Update README.md 2019-09-21 11:57:15 +08:00
Yanyi Wu
28b37b9cba
Merge pull request #134 from shove70/patch-1
Explanation for adding D language bindings.
2019-09-21 09:51:00 +08:00
shove70
4f04293261 Explanation for adding D language bindings. 2019-09-21 09:28:06 +08:00
Yanyi Wu
be6e91e6c3
Update README.md 2019-09-15 18:14:02 +08:00
Yanyi Wu
8a258dfaf4
Merge pull request #127 from byronhe/patch-2
remove duplicate #include
2019-09-15 16:54:42 +08:00
Yanyi Wu
f39192e983
Merge pull request #133 from byronhe/patch-5
fix typo
2019-09-04 22:54:46 +08:00
byronhe
55a94b417c
fix typo 2019-09-04 20:50:11 +08:00
Yanyi Wu
866d0e83b0
Merge pull request #129 from byronhe/patch-4
fix compile warning
2019-04-29 12:27:41 +08:00
byronhe
6444f4b226
fix compile warning 2019-04-29 12:18:03 +08:00
Yanyi Wu
7fc865760b
Merge pull request #128 from byronhe/patch-3
会导致别的含有 print 这个符号的代码编译不过
2019-03-16 13:46:52 +08:00
byronhe
f55b591968
会导致别的含有 print 这个符号的代码编译不过
会导致别的含有 print 这个符号的代码编译不过
2019-03-15 22:02:04 +08:00
byronhe
798b7b81c9
remove duplicate #include
remove duplicate #include
2019-03-15 15:48:09 +08:00
Yanyi Wu
8fca7300a4
Merge pull request #126 from maliubiao/master
修正c++版本判断兼容性问题
2019-03-02 23:42:53 +08:00
maliubiao
07382b9cb1
修正c++版本判断兼容性问题
c++11以上,这个分支会跳到错误的地方,错误的使用using std::tr1::unordered_map ,导致undefined symbol.
2019-03-02 16:02:02 +08:00
Yanyi Wu
31eed03518
Update README.md 2018-10-05 11:33:48 +08:00
Yanyi Wu
3dfdc426f0
Update README.md 2018-10-05 11:27:35 +08:00
Yanyi Wu
7b2fdc41a2
Merge pull request #113 from bung87/exposes_InsertUserWord_and_Find
Exposes insert user word and find
2018-06-09 19:51:12 +08:00
zhoupeng
985ccd646c Merge branch 'master' of https://github.com/yanyiwu/cppjieba into HEAD 2018-06-09 16:23:49 +08:00
zhoupeng
111fb007cf exposes InsertUserWord Find 2018-06-09 16:21:13 +08:00
Yanyi Wu
e6fdd1c98b
Merge pull request #112 from bung87/master
接口一致与完整
2018-06-08 23:13:44 +08:00
zhoupeng
1e1e585194 LoadUserDict by set,vector 2018-06-08 14:23:01 +08:00
zhoupeng
1066bc085e fix input type ,expose to Jieba 2018-06-08 01:32:47 +08:00
zhoupeng
d56e5c0659 InsertUserWord with freq arg,expose InserUserDictNode with vector<string> arg 2018-06-08 00:44:33 +08:00
Yanyi Wu
36be7fb900
Merge pull request #111 from bung87/master
增加cppjieba-py 扩展说明
2018-06-07 23:19:06 +08:00
Yanyi Wu
bd368bc04d
Merge pull request #105 from Silencezjl/patch-2
Update demo.cpp
2018-06-07 23:18:35 +08:00
zhoupeng
cb4011ac56 增加cppjieba-py 扩展说明 2018-06-07 16:52:06 +08:00
张家麟
1089dcdcd3
Update demo.cpp
多了个分号~ 虽然影响不大~
2018-01-29 10:12:38 +00:00
Yanyi Wu
6aff1f637c Merge pull request #96 from wangzhe258369/master
减少Visual Studio编译器警告
2017-06-28 00:02:16 +08:00
Wangzhe
e7602afaac 减少Visual Studio编译器警告 2017-06-27 23:00:31 +08:00
yanyiwu
dabe502bb4 fix travis compiler 2017-04-03 23:14:59 +08:00
Yanyi Wu
d42602c12d Merge pull request #88 from stphnlyd/readme
mention the Perl 5 binding for CppJieba
2017-04-03 22:56:58 +08:00
Stephan Loyd
3d04caa1b1 mention Perl 5 binding for CppJieba 2017-04-03 22:37:04 +08:00
Yanyi Wu
472a584487 Merge pull request #87 from jonnywang/patch-2
增加php扩展版本链接
2017-03-31 10:38:40 +08:00
星期八
27dbfb8146 增加php扩展版本链接 2017-03-31 10:24:39 +08:00
Yanyi Wu
e5d9eb8816 Merge pull request #79 from royguo/master
Add Unicode offset/length support for `Word`
2016-10-18 23:02:01 +08:00
Roy Guo
f74d716570 Add Unicode offset/length support for Word 2016-10-16 13:05:56 +08:00
Roy Guo
a2f75a00d3 Add Unicode offset/length support for Word 2016-10-16 12:52:50 +08:00
yanyiwu
45809955f5 v5.0.0 2016-09-11 21:44:51 +08:00
yanyiwu
74c70c70cd create keyword_extract in Jieba 2016-09-11 21:42:53 +08:00
yanyiwu
4a755dff6a may be more friendly for compiler 2016-08-11 00:00:20 +08:00
yanyiwu
53bc279dea fix compiler warning 2016-07-23 20:49:27 +08:00
yanyiwu
91b7f9af63 v4.8.1 2016-07-23 00:11:02 +08:00
yanyiwu
0984c9ed3f update user dict loading method about word weight, and add unit tests 2016-07-22 23:53:49 +08:00
Yanyi Wu
e45ac012cb Merge pull request #74 from npes87184/master
fix second element parse error in dict
2016-07-22 13:40:55 +08:00
npes87184
0c3cf04b43 fix second element parse error in dict 2016-07-22 10:19:28 +08:00
Yanyi Wu
e3e5f93ca3 Merge pull request #73 from bigelephant29/user-dict-tag-bug-fix
fix user dict tag bug : wrong buf index assigned
2016-07-21 12:26:16 +08:00
bigelephant29
986106a553 change stoi to atoi 2016-07-21 10:54:08 +08:00
bigelephant29
2e1b6e0443 user dict support user weight and user tag 2016-07-21 10:38:46 +08:00
bigelephant29
b82acaf71e fix user dict tag bug : wrong buf index assigned 2016-07-21 10:06:24 +08:00
Yanyi Wu
8b75bf14a3 Merge pull request #72 from t-k-/master
增加 LookupTag 函数来对单个的 token 进行 tag 查询
2016-07-07 11:15:59 +08:00
t-k-
e40270ca86 Avoid using `initializer lists' from C++0x. 2016-07-06 13:48:18 -06:00
t-k-
5775a40bee Add LookupTag function for single token tag lookup. 2016-07-06 02:44:56 -06:00
Yanyi Wu
667acdeb7b Merge pull request #71 from jaiminpan/master
add tag capbility for each segments
2016-07-03 20:10:49 +08:00
Jaimin Pan
ce8cafe54a add tag capbility for each segments 2016-06-27 18:10:42 +08:00
yanyiwu
ec848581b2 fix issue #70 2016-06-10 21:49:31 +08:00
Yanyi Wu
0bf9341dd6 Merge pull request #69 from vsooda/master
fix unittest cmake macro bug
2016-06-08 11:01:47 +08:00
sooda
7d503e4b13 fix unittest cmake macro bug 2016-06-08 10:38:20 +08:00
yanyiwu
c0afac2598 update changelog 2016-05-09 22:52:42 +08:00
yanyiwu
c425bcc49f add Jieba::ResetSeparators api and unittest 2016-05-09 22:49:51 +08:00
yanyiwu
6e3ecec599 improve readability 2016-05-09 22:09:57 +08:00
yanyiwu
e4e1b4e953 update readme 2016-05-09 21:23:05 +08:00
Yanyi Wu
02df433f73 Merge pull request #65 from questionfish/master
增加了TextRank关键词提取
2016-05-04 20:02:07 +08:00
Yanyi Wu
00b2eb13c6 Merge pull request #2 from yanyiwu/patch-1
Patch 1
2016-05-04 19:33:37 +08:00
yanyiwu
b355e9f487 update unittest to pass 'make test' 2016-05-04 19:33:05 +08:00
yanyiwu
0a23d6b268 merge questionfish/master 2016-05-04 19:27:05 +08:00
mayunyun
d5a52a8e7b 1. remove stopword from span windows
2. update unittest
2016-05-04 17:52:30 +08:00
yanyiwu
5c739484ae merge the latest codes in master branch, and update unittest cases to pass ci 2016-05-03 23:20:03 +08:00
questionfish
04c176de08 Merge pull request #1 from yanyiwu/patch-1
Update TextRankExtractor.hpp: use yanyiwu's correction
2016-05-03 21:46:01 +08:00
yanyiwu
f253db0133 use map/set instead of unordered_map/unordered_set to make result stable 2016-05-03 21:24:40 +08:00
yanyiwu
39316114c5 correct unittest case 2016-05-03 20:49:47 +08:00
yanyiwu
a1ea1d0757 add textrank unittest into cmake 2016-05-03 20:01:44 +08:00
Yanyi Wu
6d105a864d Update TextRankExtractor.hpp
remove unused function which using c++11 keyword `auto`
2016-05-03 19:53:40 +08:00
mayunyun
0f66a923b3 1.增加单元测试
2.增加了构造函数的重载,增加了提取函数的重载
2016-05-03 18:06:14 +08:00
mayunyun
f2de41c15e code layout change: tab -> space 2016-05-03 09:03:16 +08:00
yanyiwu
a778d47046 v4.8.0 2016-05-02 17:15:38 +08:00
yanyiwu
5ac9e48eb0 rewrite QuerySegment, make Jieba::CutForSearch behaves the same as [jieba] cut_for_search api
remove Jieba::SetQuerySegmentThreshold
2016-05-02 16:18:36 +08:00
yanyiwu
3f0faec14b windows ci test 2016-04-27 20:22:05 +08:00
Yanyi Wu
4d8d793da5 Merge pull request #63 from qinwf/windows-appveyor
add Windows CI with MSVC
2016-04-27 19:13:49 +08:00
qinwf
c84594f620 add Windows CI with MSVC 2016-04-27 17:45:48 +08:00
yanyiwu
e6074eecb9 add cppjieba-server link 2016-04-27 16:24:13 +08:00
mayunyun
1aa0a32d90 code format check 2016-04-25 20:28:47 +08:00
mayunyun
669e971e3e new file: include/cppjieba/TextRankExtractor.hpp
Add TextRank Keyword Extractor to JiebaCpp
新增TextRank关键词提取
2016-04-25 20:20:50 +08:00
yanyiwu
d9e8cdac36 v4.7.0 2016-04-21 14:28:02 +08:00
yanyiwu
9ebc906d3f update README 2016-04-19 16:04:44 +08:00
yanyiwu
3befc42697 update KeywordExtractor::Word's printing format to json format 2016-04-19 16:00:53 +08:00
yanyiwu
a9301facde upgrade limonp -> v0.6.1 2016-04-19 15:24:56 +08:00
yanyiwu
29e085904d add log and unittest 2016-04-18 14:55:42 +08:00
yanyiwu
63e9c94fb7 add unicode decoding unittest 2016-04-18 14:37:17 +08:00
yanyiwu
6fa843b527 override Cut functions, add location information into Word results; 2016-04-17 23:39:57 +08:00
yanyiwu
b6703aba90 use offset instead of str in RuneStr 2016-04-17 22:50:32 +08:00
yanyiwu
e7a45d2dde remove LevelSegment 2016-04-17 22:23:00 +08:00
yanyiwu
42a73eeb64 make compiler happy 2016-04-17 22:11:58 +08:00
yanyiwu
dcced8561e remove namespace unicode 2016-04-17 21:59:10 +08:00
yanyiwu
6ff6fe1430 WordRange construct 2016-04-17 21:57:36 +08:00
yanyiwu
339e3ca772 big change: add RuneStr for the position of word in string 2016-04-17 17:30:05 +08:00
yanyiwu
abcc0af034 update readme 2016-03-30 00:41:44 +08:00
Yanyi Wu
1fb5a7c66f Update README.md 2016-03-29 23:50:59 +08:00
Yanyi Wu
82feba693c Merge pull request #59 from bitdeli-chef/master
Add a Bitdeli Badge to README
2016-03-29 00:52:05 -05:00
Bitdeli Chef
627a514b7f Add a Bitdeli badge to README 2016-03-29 06:04:47 +00:00
yanyiwu
81cd435f2a prettify demo output 2016-03-28 01:22:24 +08:00
yanyiwu
500af453e1 add new case: sqljieba 2016-03-27 23:46:14 +08:00
yanyiwu
4b97c57bb2 v4.6.0 2016-03-26 23:34:40 +08:00
yanyiwu
c19736995c Add KeywordExtractor::Word and add more overrided KeywordExtractor::Extract 2016-03-26 22:12:40 +08:00
yanyiwu
e6a2b47b87 hange the return value of KeywordExtractor::Extract from bool to void 2016-03-26 01:16:44 +08:00
yanyiwu
5102b8a5c3 Change Jieba::Locate to be static function. 2016-03-26 01:14:48 +08:00
yanyiwu
7db3f87b5f remove info log for dict loading 2016-03-22 10:45:20 +08:00
yanyiwu
5a8a0fae7a v4.5.3 2016-03-18 16:17:57 +08:00
yanyiwu
3ef005275a Upgrade limonp to v0.6.0 2016-03-18 16:14:48 +08:00
yanyiwu
81c35dde01 v4.5.2 2016-03-18 14:32:52 +08:00
yanyiwu
92fdf009cb Upgrade limonp to v0.5.6 to fix hidden trouble. 2016-03-18 14:05:18 +08:00
yanyiwu
643148edf5 platform 2016-02-26 22:33:51 +08:00
yanyiwu
f446ecf2ed v4.5.1 2016-02-19 16:29:59 +08:00
yanyiwu
3e28b4bcb1 adjust code for limonp v0.5.5 to solve macro name conflicts 2016-02-19 16:15:23 +08:00
yanyiwu
fc04cf750a upgrade limonp to v0.5.5 2016-02-19 16:14:33 +08:00
yanyiwu
9d7da4864a v4.5.0 2016-02-18 16:21:15 +08:00
yanyiwu
0a7b6e62f3 add Unicode32 cases for cut testing 2016-02-18 15:18:35 +08:00
yanyiwu
14e09290c2 change Rune type from uint16_t to uint32_t to support more chinese word 2016-02-18 14:54:03 +08:00
yanyiwu
8d66b1f1fa upgrade limonp to v0.5.4 2016-02-18 14:48:26 +08:00
yanyiwu
239d025cd8 delete HashMap, use unordered_map instead 2016-02-16 20:24:28 +08:00
yanyiwu
e6454fef77 use HashMap in Trie, and remove the base array of trie root node, see details in Changelog 2016-02-12 01:37:39 +08:00
yanyiwu
2d3c51dba7 upgrade limonp and use limonp::HashMap in Trie 2016-02-04 23:43:26 +08:00
yanyiwu
6f303ee843 v4.4.1 2016-01-29 10:14:40 +08:00
yanyiwu
8496f41e5d update changelog.md 2016-01-29 00:49:29 +08:00
yanyiwu
721b34f1bd fix bug, see details in ChangeLog.md 2016-01-29 00:30:38 +08:00
Yanyi Wu
8ca338d75a Update README.md 2016-01-22 21:38:53 +08:00
yanyiwu
446c21851d v4.4.0 2016-01-21 22:19:10 +08:00
Yanyi Wu
550ac2ab61 Merge pull request #52 from yanyiwu/remove_server
remove server, see details in ChangeLog.md
2016-01-21 19:12:54 +08:00
yanyiwu
34668aa379 remove server, see details in ChangeLog.md 2016-01-21 01:07:31 +08:00
yanyiwu
c1a6726bcc update readme.md 2016-01-20 20:47:21 +08:00
yanyiwu
963bf516a6 v4.3.3 2016-01-20 12:21:08 +08:00
yanyiwu
4493c604b9 Yet Another Incompatibility Problem Repair: Upgrade [limonp] to version v0.5.3, fix incompatibility problem in Windows 2016-01-20 12:06:56 +08:00
yanyiwu
c34c8f3082 v4.3.2 2016-01-16 01:44:28 +08:00
yanyiwu
0482ec2b6c [limonp] to version v0.5.2, fix incompatibility problem in Windows 2016-01-16 01:34:18 +08:00
yanyiwu
eb12813194 v4.3.1 2016-01-13 00:43:11 +08:00
yanyiwu
193e717d22 override constructor in KeywordExtractor 2016-01-13 00:40:46 +08:00
yanyiwu
a6c6e8df8c v4.3.0 2016-01-11 15:02:09 +08:00
yanyiwu
b41cb0e2ee fix compile error 2016-01-11 14:50:14 +08:00
yanyiwu
d92d3f194d upgrade husky to version v0.2.2 2016-01-11 14:33:43 +08:00
yanyiwu
3ab9a34909 upgrade limonp to version v0.5.1 2016-01-11 14:30:38 +08:00
yanyiwu
3c5ad24260 source code layout change:
1. src/ -> include/cppjieba/
2. src/limonp/ -> deps/limonp/
3. server/husky -> deps/husky/
4. test/unittest/gtest -> deps/gtest
2016-01-11 14:25:02 +08:00
yanyiwu
a07a22e9c4 update README 2016-01-10 19:58:47 +08:00
yanyiwu
a740fca866 add english readme 2015-12-24 21:35:06 +08:00
yanyiwu
29306c977f add badge 2015-12-24 21:09:10 +08:00
yanyiwu
fb5d989dc6 v4.2.1 2015-12-12 21:26:45 +08:00
yanyiwu
bcb112a4b1 upgrade basic functions 2015-12-12 21:25:57 +08:00
yanyiwu
8bf70127c2 upgrade limonp to version v0.4.1 2015-12-12 21:02:40 +08:00
yanyiwu
484ce39d36 update husky to version v0.2.0 2015-12-12 19:43:49 +08:00
yanyiwu
194550823f update limonp to version v0.4.0 2015-12-12 19:42:30 +08:00
yanyiwu
c38015d0ee v4.2.0 2015-12-09 00:24:27 +08:00
yanyiwu
1d33dcfdd7 add demo into 'make test' and update readme.md about dict path separator 2015-12-09 00:23:17 +08:00
yanyiwu
8482bef442 change multi user dicts seperator from ':' to '|;' 2015-12-09 00:01:27 +08:00
yanyiwu
0989dcb2c9 gitbook-plugin-search-pro 2015-12-04 00:52:29 +08:00
yanyiwu
b3868cdf78 v4.1.2 2015-12-02 01:20:18 +08:00
yanyiwu
8dc01ae614 add Jieba::Locate function to get word location of cutted sentence 2015-12-02 01:19:23 +08:00
yanyiwu
fb63e78ed2 update cases 2015-12-01 13:46:29 +08:00
Yanyi Wu
1bdedf84ec Merge pull request #49 from jaiminpan/bugfix
避免log fatal
2015-11-28 20:51:40 +08:00
Jaimin Pan
8a956642c3 fix crash if there is black line in dictionary 2015-11-28 20:30:40 +08:00
yanyiwu
a7df45df70 v4.1.1 2015-11-26 00:53:31 +08:00
yanyiwu
60ca5093a9 add Jieba::Tag 2015-11-26 00:47:16 +08:00
yanyiwu
c27d89c60d update contact in readme.md 2015-11-10 17:16:14 +08:00
yanyiwu
c6ae23ec1f update changelog.md, version v4.1.0 2015-10-29 15:31:17 +08:00
yanyiwu
8fe4de404e add SetQuerySegmentThreshold in Jieba 2015-10-29 15:28:10 +08:00
yanyiwu
c3fd357a6d [QuerySegment] add SetMaxWordLen,GetMaxWordLen, and filter the english sentence in secondary Cut 2015-10-29 14:23:01 +08:00
yanyiwu
087f3248f8 update changelog.md 2015-10-29 12:40:38 +08:00
yanyiwu
83cc67cb15 [code style] uppercase function name 2015-10-29 12:39:10 +08:00
yanyiwu
f17c2d10e2 [code style] uppercase function name 2015-10-29 12:30:47 +08:00
yanyiwu
1a9a37aa64 update changelog 2015-10-29 12:27:37 +08:00
yanyiwu
6f51373280 support optional user word freq weight 2015-10-09 11:20:06 +08:00
yanyiwu
ecacf118e6 [code style] lower case namespace 2015-10-08 21:13:11 +08:00
yanyiwu
16b69e35c1 delete Application.hpp, use Jieba.hpp instead 2015-10-08 21:03:09 +08:00
yanyiwu
4d56be920b support optional user word freq weight 2015-10-08 20:05:27 +08:00
yanyiwu
98345d6aed add SetStaticWordWeights UserWordWeightOption 2015-10-08 17:36:52 +08:00
yanyiwu
b28d6db574 code style 2015-10-08 17:08:57 +08:00
yanyiwu
9b60537b40 update changelog.md 2015-09-25 16:25:11 +08:00
yanyiwu
9de513f1d5 new feature: loading multi user dict, path is split by : 2015-09-25 16:20:06 +08:00
yanyiwu
e55d0bf95c update limonp 2015-09-25 16:11:27 +08:00
yanyiwu
5bf7454ad2 add multi user dict unittest 2015-09-25 16:07:01 +08:00
yanyiwu
9f359f3783 v3.2.1 2015-09-24 12:03:04 +08:00
yanyiwu
c70dcdd2a9 fix bug about header file including protection 2015-09-24 11:48:50 +08:00
yanyiwu
ea4d81cde7 add segment cut case 2015-09-18 14:28:34 +08:00
yanyiwu
fbd9f51b0a updatedabout make install 2015-09-16 11:19:19 +08:00
yanyiwu
b68afb0db2 v3.2.0 2015-09-14 12:44:49 +08:00
yanyiwu
ec6a12a021 add gojieba into README.md 2015-09-14 12:03:18 +08:00
yanyiwu
eb6f47b6b0 refactor unittest 2015-09-13 18:09:56 +08:00
yanyiwu
8eef9a13a8 fix bug about optional argument hmm 2015-09-13 18:06:44 +08:00
yanyiwu
f517601c29 changelog 2015-09-13 17:38:14 +08:00
yanyiwu
f98e94869c add optional argument: hmm 2015-09-13 17:28:49 +08:00
yanyiwu
14974d51b4 abondom ISegment 2015-09-13 17:02:04 +08:00
yanyiwu
6d69363145 refactor, simplify SegmentBase 2015-09-13 16:29:35 +08:00
yanyiwu
e9241d9025 fixed the bug in the last commit 2015-09-13 16:18:48 +08:00
yanyiwu
28bcb3bf57 use PreFilter in SegmentBase 2015-09-13 16:05:17 +08:00
yanyiwu
0542dd1cfd add PreFilter 2015-09-13 15:10:10 +08:00
yanyiwu
710ddacd38 add Jieba.hpp 2015-09-13 00:28:40 +08:00
yanyiwu
63ca914176 update before_install for mac 2015-09-11 18:08:21 +08:00
yanyiwu
0ffc0f8079 make test 2015-09-11 18:06:58 +08:00
yanyiwu
19bb124b3e [enhancement issue]: https://github.com/yanyiwu/nodejieba/issues/39 2015-09-11 17:30:23 +08:00
yanyiwu
1babe57ebc 细粒度分词功能 2015-08-30 16:35:21 +08:00
yanyiwu
3c60c35906 修复FullSegment对于有些单字没有输出的bug 2015-08-30 13:09:37 +08:00
yanyiwu
001a69d8c6 增加MPSegment的细粒度分词功能。 2015-08-30 01:04:30 +08:00
yanyiwu
fae951a95d 统一私有函数的命名风格 2015-08-28 11:17:38 +08:00
yanyiwu
0e0318f6ad 集成LevelSegment进Application 2015-08-11 11:57:58 +08:00
yanyiwu
0a6b01c374 update chaneglog.md 2015-08-11 00:53:43 +08:00
yanyiwu
41e4300c9a LevelSegment 2015-08-11 00:53:06 +08:00
yanyiwu
efd029c20b namespace husky; namespace limonp; 2015-08-08 12:30:14 +08:00
yanyiwu
8a3ced2b27 去掉一些没必要的返回值判断,精简代码 2015-07-24 14:39:03 +08:00
yanyiwu
0f79fa6c24 统一在SegmentBase搞定所有Unicode和string的转码事情 2015-07-24 13:42:24 +08:00
yanyiwu
4d86abb001 新增findByLimit函数 2015-07-23 21:10:56 +08:00
yanyiwu
78e41e5fd0 规范Unicode的相关命名,使用Rune代表一个中文字符 2015-07-21 14:54:50 +08:00
yanyiwu
0e16e000ea 解决一些历史遗留问题 2015-07-21 14:32:05 +08:00
yanyiwu
620d276887 底层常用结构修整 2015-07-21 12:11:43 +08:00
yanyiwu
83222918cc 更新ChangeLog 2015-07-21 11:26:33 +08:00
Yanyi Wu
5296a83823 Merge pull request #44 from aholic/master
提升Trie的效率
2015-07-21 11:15:26 +08:00
aholic
f5e74a3f46 replace old trie 2015-07-21 00:29:49 +08:00
aholic
f5d824043c Merge branch 'master' of https://github.com/aholic/cppjieba 2015-07-21 00:17:02 +08:00
aholic
791ee25295 pull upstream 2015-07-21 00:16:49 +08:00
xuangong
cf9cc45c19 astyle 2015-07-21 00:11:13 +08:00
xuangong
931db7d1e5 astyle 2015-07-20 23:54:20 +08:00
yanyiwu
6e723c2c58 v3.1.0 2015-06-27 13:19:26 +08:00
yanyiwu
2ae6eba3a7 更新insertUserWord的示例程序 2015-06-27 13:16:25 +08:00
yanyiwu
d33c09d74a 增加单元测试 2015-06-27 12:34:27 +08:00
yanyiwu
64d073d194 支持insertUserWord接口 2015-06-27 11:39:43 +08:00
yanyiwu
c5f7d4d670 重构trie前先ci一下 2015-06-26 14:29:44 +08:00
yanyiwu
e0db070529 开放insertUserWord接口;增加cut的默认参数,默认切词算法为Mix 2015-06-26 12:22:11 +08:00
yanyiwu
1d27559209 refactor DictTrie, and expose function: insertUserWord 2015-06-26 11:49:35 +08:00
yanyiwu
ee255baf56 v3.0.1 提升兼容性,修复在某些特定环境下的编译错误问题。 2015-06-24 16:01:41 +08:00
yanyiwu
9284fe1872 性能评测 2015-06-14 12:21:09 +08:00
yanyiwu
389914ae1b 修复部分代码在 windows 上编译不通过的问题,提升兼容性。 2015-06-09 15:31:43 +08:00
yanyiwu
e3c57c0ba1 提升兼容性,修复在某些特定环境下的编译错误问题。 2015-06-08 15:01:59 +08:00
yanyiwu
67cc5941be update demo 2015-06-07 11:13:33 +08:00
yanyiwu
acd01bda99 v3.0.0 2015-06-06 11:47:04 +08:00
yanyiwu
3528b6296a 修改 cjserver 服务,可以通过http参数使用不同切词算法进行切词。
修改 make install 的安装目录,统一安装到同一个目录 /usr/local/cppjieba
2015-06-05 21:59:16 +08:00
yanyiwu
8ce2af9706 更新Demo示例文件,demo只使用一个Application实例即可。 2015-06-05 18:12:27 +08:00
yanyiwu
e5d1ac7bc8 把dict/{extra_dict,gbk_dict} 挪进 test/testdata 2015-06-05 16:31:43 +08:00
yanyiwu
a3d9b40c2a 修改QuerySegment的构造函数参数顺序 2015-06-05 16:23:51 +08:00
yanyiwu
45588b75cc 增加 Application 这个类,整合了所有CppJieba的功能进去,以后用户只需要使用这个类即可。 2015-06-05 16:00:32 +08:00
yanyiwu
d56bf2cc68 重构:增加让各个分词类的构造函数,为后面的憋大招做准备。 2015-06-04 22:38:55 +08:00
yanyiwu
b99d0698f0 将 HMMSegment 里面关于模型文件的数据独立成 HMMModel 2015-06-04 17:52:18 +08:00
yanyiwu
d3b34b73c6 更新关于分词服务中,分词算法修改的办法。 2015-06-04 14:40:34 +08:00
yanyiwu
d34ed79b03 more flexible 2015-06-04 14:39:40 +08:00
yanyiwu
9218ccb9c9 set default argument in QuerySegment: size_t maxWordLen = 4 2015-06-04 14:37:09 +08:00
yanyiwu
aed1c8f4a6 删除一些无必要的错误检查 2015-05-21 16:04:41 +08:00
yanyiwu
954100dc3d use LogFatal for more human-readable 2015-05-20 16:50:12 +08:00
yanyiwu
6e3bb7d057 use reverse_iterator 2015-05-18 23:57:13 +08:00
yanyiwu
c04b2dd0d4 增加更详细的错误日志,在初始化过程中合理使用LogFatal。 2015-05-07 20:03:19 +08:00
yanyiwu
31400cee17 update changelog 2015-05-06 23:02:57 +08:00
yanyiwu
2b18a582fc code style 2015-05-06 23:02:03 +08:00
yanyiwu
bb32234654 astyle --style=google --indent=spaces=2 2015-05-06 17:53:20 +08:00
yanyiwu
b70875f412 update LogFatal, print more readable error message when errors happened 2015-05-06 17:20:15 +08:00
yanyiwu
56c524f7a8 yanyiwu.mit-license.org 2015-04-25 12:19:24 +08:00
aholic
d1a112c0c4 improve efficiency for trie tree in ugly way 2015-04-19 21:44:50 +08:00
aholic
ea0d464519 Merge https://github.com/yanyiwu/cppjieba 2015-03-19 22:57:04 +08:00
yanyiwu
5121bf675e __APPLE__ 2015-02-28 12:49:07 +08:00
yanyiwu
b3d928a450 rename aszxqw -> yanyiwu 2015-02-11 17:11:37 +08:00
Yanyi Wu
8fe97fc898 Merge pull request #39 from qinwf/patch-test
添加英文+数字分词规则 qinwf/jiebaR#7
2015-02-06 10:59:43 +08:00
qinwf
c0bdef74fb 添加英文+数字分词规则 qinwf/jiebaR#7 2015-02-06 10:19:43 +08:00
yanyiwu
10e9b32258 little adjustment 2015-01-31 12:58:49 +08:00
yanyiwu
00f738a617 update husky for server 2015-01-31 10:14:16 +08:00
yanyiwu
660cd9d93e upload limonp for Colors.hpp and use ColorPrintln in load_test.cpp 2015-01-28 21:27:46 +08:00
yanyiwu
8c23da4332 remove debug log in hmm 2015-01-28 20:29:38 +08:00
yanyiwu
2488738b55 update unittest 2015-01-24 15:51:24 +08:00
yanyiwu
4e72d4a06f KeywordExtractor 支持自定义词典(可选参数)。 2015-01-24 15:34:34 +08:00
yanyiwu
269bc0fd0d make QuerySegment support user.dict.utf8 2015-01-23 01:10:12 +08:00
yanyiwu
a406c0f8cc 2.4.4 2015-01-06 15:29:21 +08:00
yanyiwu
51e4583fd1 update email 2015-01-06 15:28:10 +08:00
yanyiwu
7304ccb854 add iosjieba into readme.md 2014-12-24 22:57:33 +08:00
yanyiwu
dc41c9eeb9 update jieba_rb 2014-12-24 19:13:09 +08:00
wyy
5858fe29a2 update changelog.md 2014-12-16 12:45:29 +08:00
wyy
e0e0a6b976 修复typename在不同版本编译器的兼容问题 2014-12-16 12:44:48 +08:00
yanyiwu
0edb2b13cc cjieba 2014-12-16 01:30:14 +08:00
wyy
e84d57426d fix warnings 2014-11-30 01:13:25 +08:00
wyy
a63fe809b1 rm unused file 2014-11-30 00:34:17 +08:00
Yanyi Wu
de962ec97b Merge pull request #37 from qinwf/master
删除 MPSegment.hpp 中的重复头文件 以及 UBSAN 测试
2014-11-30 00:11:42 +08:00
Qin Wenfeng
2b522b20ff 使用 uint8_t 通过 UBSAN 测试 2014-11-29 19:41:12 +08:00
Qin Wenfeng
61f2031e4b 删除 MPSegment.hpp 中的重复头文件 2014-11-29 19:36:55 +08:00
wyy
e9cbec02c2 增加两条词性标注的规则,针对连续英文和数字。 2014-11-29 12:45:11 +08:00
aholic
7791290473 Merge https://github.com/aszxqw/cppjieba 2014-11-14 13:20:04 +08:00
wyy
9d5359fc34 update changelog.md 2014-11-13 01:32:38 +08:00
wyy
7868f7cdff 去除一些 template 代码 2014-11-13 01:16:38 +08:00
wyy
c119dc0a93 use localvector in dag 2014-11-12 21:18:30 +08:00
wyy
99c3405e13 move flag 2014-11-12 20:03:32 +08:00
wyy
75367a20c9 little modification 2014-11-12 19:45:20 +08:00
wyy
3ced451212 use automation 2014-11-12 18:55:17 +08:00
wyy
b9736ee132 update trie and dag , make cut faster . see details in changelog.md 2014-11-05 15:31:09 +08:00
wyy
11b041ed52 make load_test test time longer 2014-11-05 14:57:34 +08:00
aholic
283c65db0a fetch ahead 2014-11-05 11:13:00 +08:00
aholic
c2125b5371 Merge https://github.com/aszxqw/cppjieba 2014-11-05 11:12:33 +08:00
Yanyi Wu
a3671ab252 Merge pull request #36 from qinwf/master
README 中添加 jiebaR
2014-11-04 12:11:06 +08:00
Qin Wenfeng
7bf2bceee4 README 中添加 jiebaR
添加了CppJieba的R语言封装 jiebaR。
2014-11-04 12:07:33 +08:00
wyy
471a68e08e 增加测试 2014-11-03 11:30:45 +08:00
wyy
107638f7d8 修改测试数据等 2014-11-03 11:19:00 +08:00
wyy
fbae0f6075 增加两条分词规则 2014-11-03 10:54:53 +08:00
wyy
b68a76e63a 完善一些测试 2014-10-26 12:21:10 +08:00
aholic
e85a3ef8d3 fix bug for map.erase 2014-10-25 18:29:04 +08:00
wyy
11de561332 支持 docker 2014-10-25 14:47:20 +08:00
wyy
22f5e06715 docker 2014-10-25 11:21:27 +08:00
wyy
6ac7a8c85c add dockerfile 2014-10-25 00:58:31 +08:00
Yanyi Wu
82d8a23ab9 Update README.md
更新 wiki 地址
2014-10-22 23:01:27 +08:00
wyy
ad02d2d43e 更好的支持 mac osx 系统 2014-10-16 00:08:21 +08:00
wyy
b572597777 分享词典 2014-10-15 21:22:46 +08:00
wyy
0fd68846af update travis-ci for operating system osx 2014-10-12 16:22:56 +08:00
wyy
020aeaeeb0 update tagging_demo.cpp 2014-09-28 14:13:02 +08:00
wyy
ef5766904a 修改自定义词性的格式为: word tag 2014-09-28 13:43:30 +08:00
wyy
6a8ebae344 支持自定义词性 2014-09-28 13:22:37 +08:00
wyy
28246fba5d 去除 PosTagger 构造函数里一些暂时无用的参数,和增加 PosTagger 的单元测试。 2014-09-28 11:59:30 +08:00
wyy
da1b9e0c1c update limonp 2014-09-18 00:05:43 +08:00
wyy
23aee266c3 update changelog.md 2014-09-16 23:41:43 +08:00
wyy
49e3a1760f interrupt socket receive when header is too long. 2014-09-16 21:53:33 +08:00
wyy
198c483c66 update husky for become more stable 2014-09-15 23:03:58 +08:00
wyy
eb113acfbe update test/servertest 2014-09-15 22:21:37 +08:00
wyy
38af4a5fb6 update receive 2014-09-15 19:01:04 +08:00
wyy
fbbcfbdec7 update limonp and husky for threadpool using 2014-09-15 17:52:33 +08:00
wyy
e25828e0a9 update readme.md 2014-09-12 23:40:35 +08:00
wyy
698bde3c85 add ngx_http_cppjieba_module in readme.md 2014-09-06 20:52:23 +08:00
wyy
12befefe4e update changelog.md 2014-08-16 00:14:20 +08:00
wyy
269fee6f2c v2.4.2 2014-08-16 00:10:16 +08:00
wyy
4d686edb7f update unittest for compiling ok in mac 2014-08-15 22:30:52 +08:00
wyy
e317f25d94 update changelog.md 2014-08-15 22:12:02 +08:00
wyy
40eb40288d compatiable with -std=c++0x 2014-08-15 22:09:21 +08:00
wyy
9571a4d0d5 remove InitOnOff to make code lighter 2014-08-12 00:34:37 +08:00
wyy
5bfd3d0c49 update fullsegment for reducing memory cost 2014-08-11 23:34:29 +08:00
wyy
f6762e07ae update testing in readme.md 2014-07-28 20:39:01 +08:00
wyy
2113df1344 update readme.md 2014-07-23 00:36:49 +08:00
wyy
d6f114cd73 update changelog.md 2014-07-08 23:39:02 -07:00
wyy
8df0a1c89e fix max probability segmentor's bug : result is imcomplete while speical symbol in sentence 2014-07-08 23:38:06 -07:00
wyy
5b0ac64bc2 add unittest 2014-07-08 23:07:27 -07:00
wyy
007649494d avoid warning in cmake about Loggger.hpp 2014-07-05 19:18:39 +08:00
wyy
3c95ee686a update changelog.md 2014-06-13 00:36:34 +08:00
wyy
fc621ce856 add user_dict_path for server 2014-06-13 00:26:37 +08:00
wyy
c9c1ff5ac6 update readme.md 2014-06-12 23:58:55 +08:00
wyy
0ee13c8c06 fix bug about space in httpstr 2014-06-12 23:58:47 +08:00
wyy
8f5d08b7ae update readme.md 2014-06-11 19:49:14 +08:00
wyy
4a8f63fcd2 make segments NonCopyable 2014-06-11 16:18:09 +08:00
wyy
12d3741562 avoid warning in g++ 2014-06-05 19:29:57 +08:00
wyy
16e6ac0819 update changelog.md 2014-06-05 18:36:41 +08:00
wyy
a8f83dd6f0 update localvector 2014-06-05 18:30:08 +08:00
wyy
189b2725a0 add localvector 2014-06-05 01:00:17 +08:00
wyy
76dd93051e add localvector 2014-06-05 00:48:49 +08:00
wyy
014bea02ba update readme.md 2014-05-31 18:08:04 +08:00
wyy
c46980c17c minor change 2014-05-30 00:21:11 +08:00
wyy
e96885c38e update limonp/codeconverter.hpp 2014-05-29 23:57:32 +08:00
wyy
059f05c25d update limonp : add CodeConverter and delete some unused files 2014-05-29 22:39:22 +08:00
wyy
fb608627c9 update limonp 2014-05-26 17:15:52 +08:00
wyy
51ae3ffb87 update changelog.md 2014-05-24 16:14:19 +08:00
wyy
75581495b4 use vector's reserve 2014-05-24 16:09:00 +08:00
wyy
bc6ed2368d use vector's reserve 2014-05-24 15:37:31 +08:00
wyy
1a314d4b4c use vector's reserve 2014-05-24 13:44:55 +08:00
wyy
7eb896529f update .travis.yml 2014-05-24 13:36:12 +08:00
wyy
28cdc2e86b finished v2.4.1 2014-05-24 13:28:42 +08:00
wyy
ac49986592 little modification in readme.md 2014-05-22 15:18:20 +08:00
wyy
5a7f8fea95 Merge branch 'master' of github.com:aszxqw/cppjieba 2014-05-20 19:32:44 +08:00
wyy
0869568f4a modify blog url in readme.md 2014-05-20 19:31:06 +08:00
wyy
dd2e08f1e5 update EpollServer for cjserver 2014-05-17 21:21:28 -05:00
wyy
f0a0731b74 add server.conf into testdata for testing 2014-05-17 21:20:09 -05:00
wyy
f7108ce693 modify changelog.md 2014-05-17 16:28:59 +08:00
wyy
5b654f66db make single one chinese word in userdict will not be ignored in mixsegment.hpp 2014-05-17 16:22:54 +08:00
wyy
5174ac098a corrected word spell in script 2014-05-15 12:15:32 +08:00
wyy
fb25d4640c add some notices about gbk 2014-05-08 18:04:35 +08:00
wyy
2479bb1927 modify reamde.md 2014-04-27 18:44:03 +08:00
wyy
932bcc96db use travis 2014-04-26 18:27:03 +08:00
wyy
af01164c7f modify .travis.yml 2014-04-26 18:12:08 +08:00
wyy
4819d307e9 aded .travis.yml 2014-04-26 18:04:34 +08:00
wyy
376750a518 mofy cmakelists.txt for mac 2014-04-25 22:57:04 +08:00
wyy
ac6207635f modify changelog and readme 2014-04-25 22:34:22 +08:00
wyy
57ef504d9b modify test/segment_demo.cpp 2014-04-25 22:09:55 +08:00
wyy
f8487fd9cf remove src/segment and mv server.cpp into server/server.cpp and modify readme.md 2014-04-25 21:48:29 +08:00
wyy
94ae4bdd6f rm unused server in test 2014-04-25 21:21:05 +08:00
wyy
3e0aaf73a5 adding user dict interface and test ok 2014-04-25 19:30:26 +08:00
wyy
566187a49c add userdict.utf8 2014-04-25 19:22:32 +08:00
wyy
2937985243 adding user dict interface 2014-04-25 18:47:22 +08:00
wyy
dc96bb3795 add userdict loader 2014-04-25 17:29:42 +08:00
wyy
2f314ffdb1 mv *.gbk to gbk_dict 2014-04-25 17:13:14 +08:00
wyy
bea6174316 modify changelog.md 2014-04-20 00:24:30 +08:00
wyy
be3773920a modify keyword_demo 2014-04-20 00:23:42 +08:00
wyy
ae3e0a1b6a make keywordextractor faster 2014-04-20 00:20:25 +08:00
wyy
2645a4e837 add keyword extrator into load_test 2014-04-19 23:56:43 +08:00
wyy
cbe9642972 ci readme.md 2014-04-19 13:07:58 +08:00
wyy
884aa89009 add test case 2014-04-19 13:01:31 +08:00
wyy
9f100121f8 ci changelogmd 2014-04-19 12:45:43 +08:00
wyy
d6bf7cd10c modify test demo 2014-04-19 12:41:09 +08:00
wyy
e225c8c722 and modify some test case 2014-04-19 12:35:19 +08:00
wyy
a585471e76 rewrite cut for chinese special symbol 2014-04-19 11:25:13 +08:00
wyy
3d6bade24f Merge branch 'master' into for_09az 2014-04-16 20:46:15 +08:00
wyy
084bd91093 modify readme 2014-04-16 20:41:47 +08:00
wyy
d61d694ac7 do some rename 2014-04-16 19:12:24 +08:00
wyy
76d640b26e use filterSpecialChars in segmentbase.hpp 2014-04-14 22:21:09 +08:00
wyy
59dae88689 modify changelog.md 2014-04-11 16:02:11 +08:00
wyy
d9c7efdf4d add PATH in cjserver.start/stop 2014-04-11 15:41:37 -05:00
wyy
bb6c3f9e78 add shrink for vector in DictTrie.hpp 2014-04-11 15:25:03 +08:00
wyy
0ca598b747 modify changelog.md 2014-04-11 15:13:43 +08:00
wyy
0af9ae3de3 rewrite server script for cjserver 2014-04-11 15:08:24 +08:00
wyy
0d9008df7c ci changelog.md 2014-04-11 13:14:36 +08:00
wyy
cae1503725 split Trie.hpp into (Trie.hpp & DictTrie.hpp) test ok 2014-04-11 12:08:46 +08:00
wyy
24120c92b1 compile ok 2014-04-10 09:16:35 -07:00
wyy
776191b375 ci 2014-04-10 22:32:39 +08:00
wyy
abd23a4d79 rename Trie -> DictTrie 2014-04-10 21:07:11 +08:00
wyy
f70b654b66 split Trie.hpp into (Trie.hpp & DictTrie.hpp) 2014-04-10 21:05:01 +08:00
wyy
e6fde86be5 Merge branch 'dev' of github.com:aszxqw/cppjieba into dev 2014-04-10 02:59:40 -07:00
wyy
92787a3c72 fix potential bug in Trie.hpp 2014-04-10 02:58:58 -07:00
wyy
c04ab76afb fix potential bug in Trie.hpp 2014-04-10 02:58:04 -07:00
wyy
d3dc0ff240 remove isLeaf is flag 2014-04-10 13:24:53 +08:00
wyy
61f542a6b1 little modify MPSegment 2014-04-08 09:05:09 -07:00
wyy
45a7cac784 change MPSegment's cut(..., vector<TrieNodeInfo>) -> cut(..., vector<Unicode>) 2014-04-08 08:43:32 -07:00
wyy
1536a9e2e3 modify _instertNode 2014-04-08 20:39:43 +08:00
wyy
a3e0db22e8 change trie.find args 2014-04-08 19:59:02 +08:00
wyy
bfbd63f3e8 remove trie.find(xx,xx, vector) 2014-04-08 19:51:49 +08:00
wyy
f254691e53 ci MPSegment.hpp 2014-04-07 23:05:09 -07:00
wyy
687ebfc19b improve viterbi 2014-04-07 22:54:01 -07:00
wyy
278b93e851 increment load_test 2014-04-07 22:52:37 -07:00
wyy
f9af003440 ci readme.md 2014-04-05 17:45:07 +08:00
wyy
440b7b6b6b add post for server 2014-04-03 18:46:51 -07:00
wyy
b681382205 add post for server 2014-04-03 18:41:03 -07:00
wyy
ee8ec6955a add post for server 2014-04-03 10:00:40 -07:00
wyy
d0cfd042a4 ci load_test 2014-04-03 07:04:05 -07:00
wyy
d791859027 add format=simple in server 2014-04-03 04:53:00 -07:00
wyy
467fcf8434 refactor trie.hpp's loading and building 2014-04-02 09:08:45 -07:00
wyy
86de722888 little modification 2014-03-31 10:49:25 +08:00
wyy
8382828e48 little modification 2014-03-30 23:19:10 +08:00
wyy
8eb2a9bc09 fix bug in 2014-03-27 21:03:13 +08:00
wyy
abe4be255f modify some stuff to adapter lower version cmake & g++ 2014-03-27 01:41:05 -07:00
wyy
9a7ba0d685 update limonp && husky to adapter less version g++ 2014-03-27 11:44:41 +08:00
wyy
2b72d1cf91 calc time consumed in load_test 2014-03-21 11:55:40 +08:00
wyy
192bcf879a see changelog.md 2014-03-21 11:25:26 +08:00
wyy
d2d6868b75 merge some testfile into one testfile to reduce compiler cost 2014-03-21 11:18:34 +08:00
wyy
498bc431f4 rm try..catch in HMMSegment 2014-03-19 23:47:46 -07:00
wyy
ed39e558bc adapt_to_gxx44 2014-03-18 11:51:04 -05:00
wyy
52b6c61326 Merge branch 'dev' 2014-03-16 23:29:37 +08:00
wyy
ae99225880 merge v2.3.3 2014-03-16 23:29:19 +08:00
wyy
223b35f308 merge v2.3.3 2014-03-16 23:28:12 +08:00
wyy
c08154925f add -g compile argv in cmake 2014-03-16 23:23:17 +08:00
wyy
89c955c1d6 prettify Trie.hpp ing 2014-03-16 21:00:51 +08:00
wyy
762495f5f4 prettify Trie.hpp ing 2014-03-16 20:42:20 +08:00
wyy
fe7e3ff807 prettify Trie.hpp ing 2014-03-16 20:20:37 +08:00
wyy
582d61e3e8 add stopword in KeywordExtractor 2014-03-15 23:38:00 +08:00
wyy
6de292a56d add stopword in KeywordExtractor 2014-03-15 23:31:59 +08:00
wyy
4a559e7858 update Limonp 2014-03-15 23:13:32 +08:00
wyy
752ae03b34 add stop_words.utf8 2014-03-15 23:11:22 +08:00
wyy
d96c37d372 add stop_words.utf8 2014-03-15 23:10:08 +08:00
wyy
8b0bffecdb rm TrieManager.hpp 2014-03-15 23:00:32 +08:00
wyy
0829a6ae67 rm TrieManager.hpp 2014-03-15 22:59:49 +08:00
wyy
a4b0a6c762 rm TrieManager.hpp 2014-03-15 22:48:29 +08:00
wyy
ddaa5589f1 rm TrieManager.hpp 2014-03-15 22:02:48 +08:00
wyy
f9857a9ad0 ci changelog.md 2014-03-11 12:45:54 +08:00
wyy
e3b58d6ddc use map in extract to fix a unordered bug in different environment , by the way, it improves 1/6 speed 2014-03-11 10:43:06 +08:00
wyy
90d2280002 use map as DagType to fix a unordered bug in different environment , by the way, it improves 1/6 speed 2014-03-11 10:28:10 +08:00
wyy
485383c669 static const char * -> const char* const 2014-03-09 20:02:03 -07:00
wyy
df9f47eb47 add ut case 2014-03-09 04:49:47 -07:00
wyy
7e09667004 fix bug and add simple pos_tagger 2014-03-07 04:36:07 -08:00
wyy
a7f4e18027 ci TKeywordExtractor.cpp to fix bug which test in x64 and x86 not the same 2014-03-07 18:35:14 +08:00
wyy
c74ca2b458 Merge branch 'master' into dev 2014-03-06 23:36:34 -08:00
wyy
6fcaad6514 fix bug in BLACK_LIST definition 2014-03-06 23:35:32 -08:00
wyy
220066d159 fix bug in postagger.hpp 2014-02-27 13:17:00 +08:00
wyy
664ded109a modify cmakelist 2014-02-27 12:13:08 +08:00
Yanyi Wu
2159798685 Merge pull request #21 from aholic/dev
add part of speech
2014-02-27 12:05:18 +08:00
wyy
ce472622de ci 2014-02-25 20:08:46 -08:00
aholic
275a3779e5 add Part of Speech without viterbi.... 2014-02-25 21:20:48 +08:00
aholic
31e3d4fc12 Merge https://github.com/aszxqw/cppjieba into dev 2014-02-25 18:50:24 +08:00
wyy
497a3957d9 modify readme 2014-02-22 23:52:26 +08:00
wyy
15d4eb4531 ci readme.md 2014-02-13 23:19:09 +08:00
wyy
e15aa735fd ci readme.md 2014-02-13 23:17:31 +08:00
wyy
9d6ffc4de4 add new tag v2.3.1 in changelog.md 2014-02-11 14:10:54 +08:00
wyy
7ec17a2cd6 fix bug for install server to linux (using start-stop-daemon) 2014-02-11 14:03:53 +08:00
wyy
b54afeb5a3 modify readme : move some content into wiki 2014-02-11 13:08:24 +08:00
wyy
bb1e7e717b modify changelog.md 2014-02-10 11:33:56 +08:00
wyy
b7c93f196c add keyword_demo into readme.md 2014-02-10 11:30:38 +08:00
wyy
eff8d45267 fix bug: cmp function pair<string, uint> -> pair<string, double> 2014-02-10 11:16:24 +08:00
wyy
31bcaeb11e fix bug: cmp function pair<string, uint> -> pair<string, double> 2014-02-10 11:08:26 +08:00
wyy
5cf310f445 modify test for keywordextractor 2014-02-10 00:38:38 +08:00
wyy
0cfe54df3a add test/keyword_demo.cpp 2014-02-10 00:26:41 +08:00
wyy
8804f193a3 modify readme.md 2014-02-10 00:08:58 +08:00
wyy
2596cc5708 add changelog.md 2014-02-10 00:08:10 +08:00
aholic
e6ce8e23f0 Merge https://github.com/aszxqw/cppjieba into dev 2014-02-08 17:56:02 +08:00
wyy
5f96dcf09a add filter singword in keywordextractor. 2014-02-07 17:51:08 +08:00
wyy
440b168d8b ci 2014-02-02 13:53:58 +08:00
wyy
18f73f1c30 add dict/readme.md 2014-02-02 13:14:14 +08:00
wyy
f64c11c57e add blacklist 2014-01-31 17:37:40 +08:00
wyy
41a33747f4 use InitOnOff 2014-01-30 01:06:32 +08:00
wyy
d5bb4e48ec use InitOnOff 2014-01-29 20:37:26 +08:00
wyy
259b296b71 int -> uint for avoid warning 2014-01-29 20:20:24 +08:00
wyy
f1093d6cbc use mit license 2014-01-29 20:13:26 +08:00
aholic
8e2c726a8c Merge branch 'dev' of https://github.com/aszxqw/cppjieba into dev 2014-01-27 01:54:01 +08:00
aholic
e23a3f555b add hmm model files for pos tagging 2014-01-27 01:00:06 +08:00
wyy
453d4a143f add 依赖软件 in readme 2014-01-26 12:37:01 +08:00
Yanyi Wu
69a82cdbc6 Merge pull request #20 from aholic/dev
fix bug for md5File
2014-01-25 20:27:41 -08:00
aholic
ca3eddfb1e Merge https://github.com/aszxqw/cppjieba into dev 2014-01-24 00:01:52 +08:00
aholic
abb016f16a fix bug for md5File 2014-01-23 23:58:58 +08:00
Yanyi Wu
2f8821f699 Merge pull request #17 from aholic/dev
Dev
2014-01-22 04:32:06 -08:00
Yanyi Wu
fba34e1ace Merge pull request #16 from dlackty/tweak-cmake
Tweak main CMakeLists
2014-01-17 18:00:29 -08:00
Richard Lee
8ab4daf669 Tweak main CMakeLists 2014-01-18 05:58:14 +08:00
Yanyi Wu
18c205f596 Merge pull request #15 from dlackty/fix-osx-compiling
Fix OS X 10.9 compiling issues
2014-01-17 03:33:46 -08:00
Yanyi Wu
0b99454f73 Merge pull request #14 from dlackty/build-gitignore
Put build folder to gitignore
2014-01-17 03:33:21 -08:00
Richard Lee
af7fedd3ef Fix OS X 10.9 compiling issues 2014-01-17 19:11:39 +08:00
Richard Lee
1587cd1f56 Put build folder to gitignore 2014-01-17 19:08:53 +08:00
aholic
680399efdc merge upstream 2014-01-12 18:12:22 +08:00
wyy
bca6e7717f add logerror 2014-01-04 18:11:36 +08:00
wyy
80f9d2ea4c add DEFINE MIN MACRO 2014-01-04 17:48:13 +08:00
wyy
14aa9168d3 remove some warning of compiler 2013-12-24 19:33:56 -08:00
wyy
229fcd715f add another extract function in keywordextractor.hpp and ut ok 2013-12-24 19:03:52 -08:00
wyy
62b83a36a0 using idf.utf8 in keywordExtractor 2013-12-24 06:55:27 -08:00
wyy
9229fec6ca fix bug in test 2013-12-24 03:11:31 -08:00
wyy
8271320412 fix bug in cmake 2013-12-24 03:05:58 -08:00
wyy
5236c634b2 Merge remote-tracking branch 'origin/dev' 2013-12-24 02:59:19 -08:00
wyy
1db13168ff add servertest 2013-12-24 02:58:34 -08:00
wyy
cdd8517cd0 rename scripts -> script 2013-12-24 02:35:23 -08:00
wyy
418b18db55 rename dicts -> dict 2013-12-24 02:32:00 -08:00
wyy
0db2dfa6b8 finished KeywordExtractor and its ut 2013-12-24 01:22:02 -08:00
wyy
0f7947d1e3 update husky and limonp 2013-12-23 23:59:52 -08:00
wyy
3eb0470c2f update husky and limonp 2013-12-23 23:58:54 -08:00
wyy
24a15cd128 rename and finishing KeywordExtractor.hpp 2013-12-23 19:22:59 -08:00
wyy
657aee0fda mv filterAscii from ChineseFilter.hpp into SegmentBase.hpp 2013-12-21 21:58:15 -08:00
wyy
679179859e add some log debug & info 2013-12-21 21:47:01 -08:00
wyy
1b801c28a1 add load_test into cmake 2013-12-21 20:14:24 -08:00
wyy
5bd4930d41 Merge branch 'dev' of https://github.com/aszxqw/cppjieba into dev 2013-12-21 20:08:55 -08:00
Wu Yanyi
c87804758e Merge pull request #12 from aholic/dev
add unitest | fux bug in QuerySegment | make TrieManager looks better
2013-12-21 20:09:23 -08:00
wyy
bbaa8b684d modify load_test 2013-12-21 20:08:40 -08:00
wyy
cac77cdedf change copyright 2013-12-21 19:14:48 -08:00
wyy
fa75f0f319 modify construction and init for segments 2013-12-21 09:37:12 -08:00
wyy
f89cf00552 init TfIdfKeyWord.hpp 2013-12-20 08:57:10 -08:00
wyy
670c7e4a13 finished TTrie.hpp 2013-12-19 09:07:00 -08:00
wyy
3395b57227 add ttrie.cpp 2013-12-19 08:22:09 -08:00
wyy
202e4670f1 modify README.md 2013-12-19 06:12:12 -08:00
wyy
335a7eff47 add THMMSegment.cpp and TMPSegment.cpp for fix little error in using hmmsegment and mpsegment. 2013-12-18 22:42:46 -08:00
wyy
9f35b82dd1 add TMixSegment.cpp for testing 2013-12-18 22:24:39 -08:00
wyy
2e2036bb73 Merge branch 'master' into dev 2013-12-18 22:17:27 -08:00
Wu Yanyi
8c9907b27a Merge pull request #10 from aholic/dev
change QuerySegment algorithm | add TrieManager | add md5 for file
2013-12-18 22:07:01 -08:00
wyy
b669cf5db1 modify test/ && ci for lunch 2013-12-18 20:21:40 -08:00
wyy
24d5da946d modify test 2013-12-18 04:13:40 -08:00
Wu Yanyi
7cace45a2b Update README.md 2013-12-17 21:12:20 +08:00
Wu Yanyi
be97fbc78a Update README.md 2013-12-17 21:11:14 +08:00
aholic
14480a079a add unitest for md5File() 2013-12-17 04:46:33 +08:00
aholic
d9880feb03 someone forget to assign value for maxWordLen in QuerySegment 2013-12-17 02:30:27 +08:00
aholic
496e593d53 fux bug in QuerySegment | changes cause by init() 2013-12-17 01:27:53 +08:00
aholic
9af21d9658 merge head 2013-12-17 01:19:04 +08:00
aholic
072045979f fix a little bug in QuerySegment.hpp 2013-12-17 01:07:10 +08:00
aholic
218480aac1 add unitest for TrieManager 2013-12-17 01:06:44 +08:00
aholic
4f21617180 add test data for TrieManager 2013-12-17 01:05:46 +08:00
aholic
bdd1381810 clear tmp result to fix bug in QuerySegment.hpp 2013-12-17 00:03:25 +08:00
aholic
d8e00f7d62 make TrieManager.hpp looks better 2013-12-17 00:02:48 +08:00
aholic
12a4eea111 add unitest for FullSegment and QuerySegment 2013-12-17 00:00:38 +08:00
aholic
17cd0bd899 add unitest for FullSegment and QuerySegment 2013-12-16 23:59:37 +08:00
aholic
3160aac468 update md5.hpp in limonp 2013-12-16 16:32:24 +08:00
aholic
82424cc7f5 add FullSegment QuerySegment TrieManger to README.md 2013-12-16 14:42:53 +08:00
aholic
7add684a8a change algorithm for QuerySegment(now is mix+full) | use TrieManager to get a trie for all Segment 2013-12-16 14:18:44 +08:00
aholic
a0f588a8af update md5.hpp in limonp | change map type in TrieManager.hpp 2013-12-16 07:01:50 +08:00
aholic
0bc5f6f00b Merge branch 'dev' of https://github.com/aszxqw/cppjieba into dev 2013-12-16 06:06:41 +08:00
aholic
7c7d5e29bc update Limonp, add TrieManager to manage tries 2013-12-16 06:03:04 +08:00
wyy
86b78e723d add unittest 2013-12-14 23:46:17 -08:00
wyy
3545eef281 modify test 2013-12-14 22:18:39 -08:00
wyy
72ba32dd0a add unittest using gtest 2013-12-14 22:08:03 -08:00
wyy
7744e7c36c add weicheng.utf8 2013-12-14 19:37:17 -08:00
Wu Yanyi
d47900d65a Merge pull request #9 from aholic/master
remove NO_CODING_LOG | make MixSegment looks better
2013-12-14 06:27:25 -08:00
aholic
d54de9cd6b Merge branch 'dev' of https://github.com/aszxqw/cppjieba into dev 2013-12-14 13:58:42 +08:00
wyy
1b1ed6e3aa Merge branch 'dev' 2013-12-12 23:26:13 -08:00
wyy
e8f72692c1 modify test 2013-12-12 23:24:04 -08:00
wyy
f3e0df12f7 modify test 2013-12-12 23:21:27 -08:00
wyy
1e29d25855 use assert for getinitflag 2013-12-11 04:52:33 -08:00
wyy
acb4150e3c remove some unused code 2013-12-08 03:29:28 -08:00
wyy
313e05da1b ci for lunch 2013-12-07 20:25:28 -08:00
wyy
bcc2329a0e modify README.md 2013-12-07 08:11:43 -08:00
wyy
1169521c42 modify calcDAG to speed up 2013-12-07 08:05:05 -08:00
wyy
15685d5cf2 modify test 2013-12-07 07:46:16 -08:00
wyy
81c2d3caf1 modify calcDAG try to speed up 2013-12-07 07:45:06 -08:00
wyy
32bafd78f0 ci 2013-12-07 06:53:38 -08:00
wyy
5a82b61e02 modify test 2013-12-07 06:24:44 -08:00
Wu Yanyi
e982f730af Merge pull request #7 from aholic/master
add QuerySegment
2013-12-07 05:32:14 -08:00
wyy
18de12a21c add NO_FILTER macro 2013-12-07 17:32:11 +08:00
wyy
45a5df5856 fix bug's in SegmentBase where using filterAscii 2013-12-07 15:22:00 +08:00
wyy
8e8a68352b modify some gbk enc for more robust 2013-12-06 22:52:59 -08:00
wyy
7106a4475f Merge branch 'master' into dev 2013-12-06 06:50:48 -08:00
Wu Yanyi
5661a7ae3c Merge pull request #6 from aholic/master
add FullSegment
2013-12-06 06:50:37 -08:00
wyy
cde97bf9b8 remove chinesefilter 2013-12-06 06:19:54 -08:00
wyy
5692220756 add filterAscii 2013-12-06 06:01:45 -08:00
wyy
1bdce8904f update enc 2013-12-06 04:57:19 -08:00
wyy
1576d15b2f modify ChineseFilter.hpp to identify ansi char to speed up 2013-12-06 04:20:57 -08:00
wyy
0e61f02afe modify test 2013-12-06 04:15:53 -08:00
wyy
ceeff56cdd add test 2013-12-06 02:45:25 -08:00
wyy
d82ac83a8a modify cmakelists.txt for gbk encoding 2013-12-06 00:21:18 -08:00
wyy
b1a71f0495 add init dipose into ISegment.hpp 2013-12-05 23:38:02 -08:00
wyy
2728d8311e ci for save 2013-12-05 22:13:22 -08:00
wyy
f79ce99a55 ci for dinner 2013-12-05 02:54:29 -08:00
wyy
bcff58ff36 modify segment.cpp for example 2013-12-05 01:04:16 -08:00
wyy
45a85ed845 Merge branch 'master' into dev 2013-12-05 00:59:26 -08:00
wyy
99bb3dd3b1 add testlines.gbk 2013-12-04 16:58:52 -08:00
wyy
2641b1d0a3 modify segment.cpp 2013-12-04 08:39:47 -08:00
wyy
fd7ff031d0 add gbk 2013-12-04 08:00:27 -08:00
wyy
35ba8f058e mv unicode <=> utf8 from transcode.hpp into Limonp/str_functs.hpp 2013-12-04 07:13:31 -08:00
Wu Yanyi
47ba6b60ee Merge pull request #5 from aholic/master
gg=G for MPSegment.hpp
2013-12-02 06:01:54 -08:00
wyy
fc55fb4ccc replace tab with space in Trie.hpp 2013-12-02 05:38:34 -08:00
wyy
455ae66ab6 update readme.md 2013-11-30 07:29:21 -08:00
wyy
5b8345539e update scripts/cjserver 2013-11-30 07:18:15 -08:00
wyy
751da14611 rm start.sh stop.sh 2013-11-30 05:21:01 -08:00
wyy
7dfb38f599 mv daemon out server.cpp 2013-11-30 05:18:15 -08:00
wyy
2e2fc0ff15 update husky to be hpp 2013-11-30 05:15:51 -08:00
wyy
bdb645ce69 Merge remote-tracking branch 'origin/hpp_ing' into dev 2013-11-30 05:12:18 -08:00
wyy
5799d6d487 update husky to be hpp 2013-11-30 05:12:10 -08:00
wyy
0ba5522b42 add cjserver mocked from redis-server 2013-11-30 05:08:44 -08:00
wyy
bbdd041ee5 rm globals.h 2013-11-30 13:06:50 +08:00
wyy
e8116cd07a rm globals.h 2013-11-30 13:05:16 +08:00
wyy
ccaeeb5bb0 delete structs.h 2013-11-30 12:52:17 +08:00
wyy
58e69783cc merge MixSegment.h/cpp into hpp 2013-11-30 12:41:31 +08:00
wyy
55c64e9893 merge HMMSegment.h/cpp into hpp 2013-11-30 12:34:57 +08:00
wyy
6484342c8f merge MPsegment.h/cpp into hpp 2013-11-30 12:26:57 +08:00
wyy
abfc3b4b6c merge trie.h/cpp into trie.hpp 2013-11-30 12:17:36 +08:00
aholic
599c130bd9 make MixSegment looks better 2013-11-28 10:49:40 +08:00
aholic
12328a3a7e remove macro NO_CODING_LOG 2013-11-28 09:17:29 +08:00
aholic
27bc6d0eb1 Merge https://github.com/aszxqw/cppjieba 2013-11-27 22:52:08 +08:00
aholic
dc5eeb5531 Merge branch 'master' of https://github.com/aholic/cppjieba 2013-11-27 21:07:39 +08:00
aholic
03f5144a3e add QuerySegment 2013-11-27 21:05:35 +08:00
aholic
a25007f032 add QuerySegment 2013-11-27 20:59:04 +08:00
aholic
eda4acceb5 add QuerySegment 2013-11-27 20:49:00 +08:00
aholic
f396047e49 add QuerySegment 2013-11-27 20:46:45 +08:00
aholic
ef8954f1fe merge upstream 2013-11-27 16:32:54 +08:00
aholic
26af60d867 add fullSegment 2013-11-27 16:16:10 +08:00
aholic
7f57443829 gg=G for MPSegment.hpp 2013-11-27 05:08:06 +08:00
115 changed files with 387221 additions and 4173 deletions

40
.github/workflows/cmake-arm64.yml vendored Normal file
View File

@ -0,0 +1,40 @@
name: CMake Windows ARM64
on:
push:
pull_request:
workflow_dispatch:
env:
BUILD_TYPE: Release
jobs:
build-windows-arm64:
runs-on: windows-2022
strategy:
matrix:
cpp_version: [11, 14, 17, 20]
steps:
- name: Check out repository code
uses: actions/checkout@v2
with:
submodules: recursive
- name: Configure CMake
# Configure CMake in a 'build' subdirectory. `CMAKE_BUILD_TYPE` is only required if you are using a single-configuration generator such as make.
# See https://cmake.org/cmake/help/latest/variable/CMAKE_BUILD_TYPE.html?highlight=cmake_build_type
# run: cmake -B ${{github.workspace}}/build -DCMAKE_BUILD_TYPE=${{env.BUILD_TYPE}}
run: cmake -B ${{github.workspace}}/build -DBUILD_TESTING=ON -DCMAKE_CXX_STANDARD=${{matrix.cpp_version}} -DCMAKE_BUILD_TYPE=${{env.BUILD_TYPE}}
- name: Build
# Build your program with the given configuration
# run: cmake --build ${{github.workspace}}/build --config ${{env.BUILD_TYPE}}
run: cmake --build ${{github.workspace}}/build --config ${{env.BUILD_TYPE}}
- name: Test
working-directory: ${{github.workspace}}/build
# Execute tests defined by the CMake configuration.
# See https://cmake.org/cmake/help/latest/manual/ctest.1.html for more detail
run: ctest -C ${{env.BUILD_TYPE}} --verbose

53
.github/workflows/cmake.yml vendored Normal file
View File

@ -0,0 +1,53 @@
name: CMake
on:
push:
pull_request:
env:
# Customize the CMake build type here (Release, Debug, RelWithDebInfo, etc.)
BUILD_TYPE: Release
jobs:
build:
# The CMake configure and build commands are platform agnostic and should work equally well on Windows or Mac.
# You can convert this to a matrix build if you need cross-platform coverage.
# See: https://docs.github.com/en/free-pro-team@latest/actions/learn-github-actions/managing-complex-workflows#using-a-build-matrix
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [
ubuntu-22.04,
ubuntu-latest,
macos-13,
macos-14,
macos-latest,
windows-2019,
windows-2022,
windows-latest,
]
cpp_version: [11, 14, 17, 20]
steps:
- name: Check out repository code
uses: actions/checkout@v2
with:
submodules: recursive
- name: Configure CMake
# Configure CMake in a 'build' subdirectory. `CMAKE_BUILD_TYPE` is only required if you are using a single-configuration generator such as make.
# See https://cmake.org/cmake/help/latest/variable/CMAKE_BUILD_TYPE.html?highlight=cmake_build_type
# run: cmake -B ${{github.workspace}}/build -DCMAKE_BUILD_TYPE=${{env.BUILD_TYPE}}
run: cmake -B ${{github.workspace}}/build -DBUILD_TESTING=ON -DCMAKE_CXX_STANDARD=${{matrix.cpp_version}} -DCMAKE_BUILD_TYPE=${{env.BUILD_TYPE}}
- name: Build
# Build your program with the given configuration
# run: cmake --build ${{github.workspace}}/build --config ${{env.BUILD_TYPE}}
run: cmake --build ${{github.workspace}}/build --config ${{env.BUILD_TYPE}}
- name: Test
working-directory: ${{github.workspace}}/build
# Execute tests defined by the CMake configuration.
# See https://cmake.org/cmake/help/latest/manual/ctest.1.html for more detail
run: ctest -C ${{env.BUILD_TYPE}} --verbose

25
.github/workflows/stale-issues.yml vendored Normal file
View File

@ -0,0 +1,25 @@
name: Close Stale Issues
on:
schedule:
- cron: '0 0 3 */3 *' # Every three months on the 3rd day at midnight
jobs:
stale:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v5
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
stale-issue-message: 'This issue has not been updated for over 1 year and will be marked as stale. If the issue still exists, please comment or update the issue, otherwise it will be closed after 7 days.'
close-issue-message: 'This issue has been automatically closed due to inactivity. If the issue still exists, please reopen it.'
days-before-issue-stale: 365
days-before-issue-close: 7
stale-issue-label: 'Stale'
exempt-issue-labels: 'pinned,security'
operations-per-run: 100

3
.gitignore vendored
View File

@ -14,3 +14,6 @@ prior.gbk
tmp
t.*
*.pid
build
Testing/Temporary/CTestCostData.txt
Testing/Temporary/LastTest.log

3
.gitmodules vendored Normal file
View File

@ -0,0 +1,3 @@
[submodule "deps/limonp"]
path = deps/limonp
url = https://github.com/yanyiwu/limonp.git

313
CHANGELOG.md Normal file
View File

@ -0,0 +1,313 @@
# CHANGELOG
## v5.5.0
+ feat: add Windows ARM64 build support
+ build: upgrade googletest from 1.11.0 to 1.12.1
+ build: update CMake minimum version requirement to 3.10
+ fix: make namespaces explicit and fix missing includes
+ ci: update stale-issues workflow configuration
## v5.4.0
+ unittest: class Jiaba add default argument input
+ class Jieba: support default dictpath
+ cmake: avoid testing when FetchContent by other project
+ class DictTrie: removed unused var
## v5.3.2
+ removed test/demo.cpp and linked https://github.com/yanyiwu/cppjieba-demo
+ Update Demo Link in README.md
+ [github/actions] stale 1 year ago issues
+ limonp v0.9.0 -> v1.0.0
## v5.3.1
+ [cmake] fetch googletest
+ [submodules] rm test/googletest
## v5.3.0
+ [c++17,c++20] compatibility
+ limonp version 0.6.7 -> 0.9.0
## v5.2.0
+ [CI] windows-[2019,2022]
+ [googletest] v1.6.0->v1.10.0
+ [CI] ubuntu version from 20 to 22, macos version from 12 to 14
+ [CMake] mini_required 2.6->3.5 and fix CXX_VERSION variable passed from cmd
+ [CI] matrix and multi cpp version [11, 14]
## v5.1.3
+ [googletest] git submodule add googletest-1.6.0
## v5.1.2
+ [submodule:deps/limonp] upgrade to v0.6.7
## v5.1.1
+ Merged [pr-186](https://github.com/yanyiwu/cppjieba/pull/186)
## v5.1.0
+ Merged [feature: add RemoveWord api from gojieba/pull/99 #172](https://github.com/yanyiwu/cppjieba/pull/172)
## v5.0.5
+ Merged [pr-171 submodule update limonp to v0.6.6 #171](https://github.com/yanyiwu/cppjieba/pull/171)
## v5.0.4
+ Merged [pr-168 limonp as submodule #168](https://github.com/yanyiwu/cppjieba/pull/168)
## v5.0.3
+ Upgrade [limonp](https://github.com/yanyiwu/limonp) -> v0.6.3
## v5.0.2
+ Upgrade [limonp](https://github.com/yanyiwu/limonp) -> v0.6.1
## v5.0.1
+ Make Compiler Happier.
+ Add PHP, DLang Links.
## v5.0.0
+ Notice(**api changed**) : Jieba class 3 arguments -> 5 arguments, and use KeywordExtractor in Jieba
## v4.8.1
+ add TextRankExtractor by [@questionfish] in [pull request 65](https://github.com/yanyiwu/cppjieba/pull/65)
+ add Jieba::ResetSeparators api for some special situation, for example in [issue67](https://github.com/yanyiwu/cppjieba/issues/67)
+ fix [issue70](https://github.com/yanyiwu/cppjieba/issues/70)
+ support (word, freq, tag) format in user_dict, see details in [pr74](https://github.com/yanyiwu/cppjieba/pull/74)
## v4.8.0
+ rewrite QuerySegment, make `Jieba::CutForSearch` behaves the same as [jieba] `cut_for_search` api
+ remove Jieba::SetQuerySegmentThreshold
## v4.7.0
api changes:
+ override Cut functions, add location information into Word results;
+ remove LevelSegment;
+ remove Jieba::Locate;
upgrade:
+ limonp -> v0.6.1
## v4.6.0
+ Change Jieba::Locate(deprecated) to be static function.
+ Change the return value of KeywordExtractor::Extract from bool to void.
+ Add KeywordExtractor::Word and add more overrided KeywordExtractor::Extract
## v4.5.3
+ Upgrade limonp to v0.6.0
## v4.5.2
+ Upgrade limonp to v0.5.6 to fix hidden trouble.
## v4.5.1
+ Upgrade limonp to v0.5.5 to solve macro name conficts in some special case.
## v4.5.0
+ 在 Trie 中去除之前糟糕的针对 uint16 优化的用数组代替 map 的设计,
该设计的主要问题是前提 unicode 每个字符必须是 uint16 ,则无法更全面得支持 unicode 多国字符。
+ Rune 类型从 16bit 更改为 32bit ,支持更多 Unicode 字符,包括一些罕见汉字。
## v4.4.1
+ 使用 valgrind 检查内存泄露的问题定位出一个HMM模型初始化的问题导致内存泄露的bug不过此内存泄露不是致命问题
因为只会在词典载入的时候发生,而词典载入通常情况下只会被运行一次,故不会导致严重问题。
+ 感谢 [qinwf] 帮我发现这个bug非常感谢。
## v4.4.0
+ 加代码容易删代码难,思索良久,还是决定把 Server 功能的源码剥离出这个项目。
+ 让 [cppjieba] 回到当年情窦未开时清纯的感觉删除那些无关紧要的server代码让整个项目轻装上阵专注分词的核心代码。
+ By the way, 之前的 server 相关的代码,如果你真的需要它,就去新的项目仓库 [cppjieba-server](https://github.com/yanyiwu/cppjieba-server) 找它们。
## v4.3.3
+ Yet Another Incompatibility Problem Repair: Upgrade [limonp] to version v0.5.3, fix incompatibility problem in Windows
## v4.3.2
+ Upgrade [limonp] to version v0.5.2, fix incompatibility problem in Windows
## v4.3.1
+ 重载 KeywordExtractor 的构造函数,可以传入 Jieba 进行字典和模型的构造。
## v4.3.0
源码目录布局调整:
1. src/ -> include/cppjieba/
2. src/limonp/ -> deps/limonp/
3. server/husky -> deps/husky/
4. test/unittest/gtest -> deps/gtest
依赖库升级:
1. [limonp] to version v0.5.1
2. [husky] to version v0.2.0
## v4.2.1
1. Upgrade [limonp] to version v0.4.1, [husky] to version v0.2.0
## v4.2.0
1. 修复[issue50]提到的多词典分隔符在Windows环境下存在的问题从':'修改成'|'或';'。
## v4.1.2
1. 新增 Jieba::Locate 函数接口,作为计算分词结果的词语位置信息,在某些场景下有用,比如搜索结果高亮之类的。
## v4.1.1
1. 在 class Jieba 中新增词性标注的接口函数 Jieba::Tag
## v4.1.0
1. QuerySegment切词时加一层判断当长词满足IsAllAscii(比如英文单词)时,不进行细粒度分词。
2. QuerySegment新增SetMaxWordLen和GetMaxWordLen接口用来设置二次分词条件被触发的词长阈值。
3. Jieba新增SetQuerySegmentThreshold设置CutForSearch函数的词长阈值。
## v4.0.0
1. 支持多个userdict载入多词典路径用英文冒号(:)作为分隔符就当是向环境变量PATH致敬哈哈。
2. userdict是不带权重的之前对于新的userword默认设置词频权重为最大值现已支持可配置默认使用中位值。
3. 【兼容性预警】修改一些代码风格比如命名空间小写化从CppJieba变成cppjieba。
4. 【兼容性预警】弃用Application.hpp, 取而代之使用Jieba.hpp 接口也进行了大幅修改函数风格更统一和python版本的Jieba分词更一致。
## v3.2.1
1. 修复 Jieba.hpp 头文件保护写错导致的 bug。
## v3.2.0
1. 使用工程上比较 tricky 的 Trie树优化办法。废弃了之前的 `Aho-Corasick-Automation` 实现,可读性更好,性能更高。
2. 新增层次分词器: LevelSegment 。
3. 增加MPSegment的细粒度分词功能。
4. 增加 class Jieba ,提供可读性更好的接口。
5. 放弃了统一接口ISegment因为统一的接口限制了分词方式的灵活性限制了一些功能的增加。
6. 增加默认开启新词发现功能的可选参数hmm让MixSegment和QuerySegment都支持开关新词发现功能。
## v3.1.0
1. 新增可动态增加词典的API: insertUserWord
2. cut函数增加默认参数默认使用Mix切词算法。关于切词算法详见README.md
## v3.0.1
1. 提升兼容性,修复在某些特定环境下的编译错误问题。
## v3.0.0
1. 使得 QuerySegment 支持自定义词典(可选参数)。
2. 使得 KeywordExtractor 支持自定义词典(可选参数)。
3. 修改 Code Style ,参照 google code style 。
4. 增加更详细的错误日志在初始化过程中合理使用LogFatal。
5. 增加 Application 这个类整合了所有CppJieba的功能进去以后用户只需要使用这个类即可。
6. 修改 cjserver 服务可以通过http参数使用不同切词算法进行切词。
7. 修改 make install 的安装目录,统一安装到同一个目录 /usr/local/cppjieba 。
## v2.4.4
1. 修改两条更细粒度的特殊过滤规则,将连续的数字(包括浮点数)和连续的字母单独切分出来(而不会混在一起)。
2. 修改最大概率法时动态规划过程需要使用的 DAG 数据结构(同时也修改 Trie 的 DAG 查询函数),提高分词速度 8% 。
3. 使用了 `Aho-Corasick-Automation` 算法提速 Trie 查找的过程等优化,提升性能。
4. 增加词性标注的两条特殊规则。
## v2.4.3
1. 更新 [husky] 服务代码,新 [husky] 为基于线程池的服务器简易框架。并且修复当 HTTP POST 请求时 body 过长数据可能丢失的问题。
2. 修改 PosTagger 的参数结构,删除暂时无用的参数。并添加使用自定义字典的参数,也就是支持 **自定义词性**
3. 更好的支持 `mac osx` (原谅作者如此屌丝,这么晚才买 `mac` )。
4. 支持 `Docker` ,具体请见 `Dockerfile`
## v2.4.2
1. 适当使用 `vector` 的基础上,使用`limonp/LocalVector.hpp`作为`Unicode`的类型等优化,约提高性能 `30%`
2. 使 `cjserver` 支持用户自定义词典,通过在 `conf/server.conf` 里面配置 `user_dict_path` 来实现。
3. 修复 `MPSegment` 切词时,当句子中含有特殊字符时,切词结果不完整的问题。
4. 修改 `FullSegment` 减少内存使用。
5. 修改 `-std=c++0x` 或者 `-std=c++11` 时编译失败的问题。
## v2.4.1
1. 完善一些特殊字符和字母串的切词效果。
2. 提高关键词抽取的速度。
3. 提供用户自定义词典的接口。
4. 将server相关的代码独立出来单独放在`server/`目录下。
5. 修复用户自定义词典中单字会被MixSegment的新词发现功能给忽略的问题。也就是说现在的词典是用户词典优先级最高其次是自带的词典再其次是新词发现出来的词。
## v2.4.0
1. 适配更低级版本的`g++``cmake`,已在`g++ 4.1.2``cmake 2.6`上测试通过。
2. 修改一些测试用例的文件,减少测试时编译的时间。
3. 修复`make install`相关的问题。
4. 增加HTTP服务的POST请求接口。
5. 拆分`Trie.hpp``DictTrie.hpp``Trie.hpp`将trie树这个数据结构抽象出来并且修复Trie这个类潜在的bug并完善单元测试。
6. 重写cjserver的启动和停止新启动和停止方法详见README.md。
## v2.3.4
1. 修改了设计上的问题,删除了`TrieManager`这个类,以避免造成一些可能的隐患。
2. 增加`stop_words.utf8`词典,并修改`KeywordExtractor`的初始化函数用以使用此词典。
3. 优化了`Trie`树相关部分代码结构。
## v2.3.3
1. 修复因为使用unordered_map导致的在不同机器上结果不一致的问题。
2. 将部分数据结果从unordered_map改为map提升了差不多1/6的切词速度。(因为unordered_map虽然查找速度快但是在范围迭代的效率较低。)
## v2.3.2
1. 修复单元测试的问题有些case在x84和x64中结果不一致。
2. merge进词性标注的简单版本。
## v2.3.1
1. 修复安装时的服务启动问题不过安装切词服务只是linux下的一个附加功能不影响核心代码。
## v2.3.0
1. 增加`KeywordExtractor.hpp`来进行关键词抽取。
2. 使用`gtest`来做单元测试。
## v2.2.0
1. 性能优化提升切词速度约6倍。
2. 其他暂时也想不起来了。
## v2.1.1 (v2.1.1之前的统统一起写在 v2.1.1里面了)
1. 完成__最大概率分词算法__和__HMM分词算法__并且将他们结合起来成效果最好的`MixSegment`
2. 进行大量的代码重构将主要的功能性代码都写成了hpp文件。
3. 使用`cmake`工具来管理项目。
4. 使用 [limonp]作为工具函数库,比如日志,字符串操作等常用函数。
5. 使用 [husky] 搭简易分词服务的服务器框架。
[limonp]:http://github.com/yanyiwu/limonp.git
[husky]:http://github.com/yanyiwu/husky.git
[issue50]:https://github.com/yanyiwu/cppjieba/issues/50
[qinwf]:https://github.com/yanyiwu/cppjieba/pull/53#issuecomment-176264929
[jieba]:https://github.com/fxsjy/jieba
[@questionfish]:https://github.com/questionfish

View File

@ -1,7 +1,31 @@
CMAKE_MINIMUM_REQUIRED (VERSION 3.10)
PROJECT(CPPJIEBA)
SET(CMAKE_INSTALL_PREFIX /usr)
ADD_DEFINITIONS(-std=c++0x -O3)
ADD_SUBDIRECTORY(src)
ADD_SUBDIRECTORY(dicts)
ADD_SUBDIRECTORY(scripts)
ADD_SUBDIRECTORY(conf)
INCLUDE_DIRECTORIES(${PROJECT_SOURCE_DIR}/deps/limonp/include
${PROJECT_SOURCE_DIR}/include)
if(NOT DEFINED CMAKE_CXX_STANDARD)
set(CMAKE_CXX_STANDARD 11)
endif()
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)
ADD_DEFINITIONS(-O3 -g)
# Define a variable to check if this is the top-level project
if(NOT DEFINED CPPJIEBA_TOP_LEVEL_PROJECT)
if(CMAKE_CURRENT_SOURCE_DIR STREQUAL CMAKE_SOURCE_DIR)
set(CPPJIEBA_TOP_LEVEL_PROJECT ON)
else()
set(CPPJIEBA_TOP_LEVEL_PROJECT OFF)
endif()
endif()
if(CPPJIEBA_TOP_LEVEL_PROJECT)
ENABLE_TESTING()
message(STATUS "MSVC value: ${MSVC}")
ADD_SUBDIRECTORY(test)
ADD_TEST(NAME ./test/test.run COMMAND ./test/test.run)
ADD_TEST(NAME ./load_test COMMAND ./load_test)
endif()

View File

@ -1,6 +1,6 @@
The MIT License (MIT)
Copyright (c) 2013 Wu Yanyi
Copyright (c) 2013
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in

285
README.md
View File

@ -1,61 +1,85 @@
#CppJieba是"结巴"中文分词的C++版本
# CppJieba
## 中文编码
[![CMake](https://github.com/yanyiwu/cppjieba/actions/workflows/cmake.yml/badge.svg)](https://github.com/yanyiwu/cppjieba/actions/workflows/cmake.yml)
[![Author](https://img.shields.io/badge/author-@yanyiwu-blue.svg?style=flat)](http://yanyiwu.com/)
[![Platform](https://img.shields.io/badge/platform-Linux,macOS,Windows-green.svg?style=flat)](https://github.com/yanyiwu/cppjieba)
[![Performance](https://img.shields.io/badge/performance-excellent-brightgreen.svg?style=flat)](http://yanyiwu.com/work/2015/06/14/jieba-series-performance-test.html)
[![Tag](https://img.shields.io/github/v/tag/yanyiwu/cppjieba.svg)](https://github.com/yanyiwu/cppjieba/releases)
现在支持utf8,gbk编码的分词。
## 简介
- `master`分支支持`utf8`编码
- `gbk`分支支持`gbk`编码
CppJieba是"结巴(Jieba)"中文分词的C++版本
## 安装与使用
### 主要特点
### 下载和安装
- 🚀 高性能:经过线上环境验证的稳定性和性能表现
- 📦 易集成:源代码以头文件形式提供 (`include/cppjieba/*.hpp`),包含即可使用
- 🔍 多种分词模式:支持精确模式、全模式、搜索引擎模式等
- 📚 自定义词典:支持用户自定义词典,支持多词典路径(使用'|'或';'分隔)
- 💻 跨平台:支持 Linux、macOS、Windows 操作系统
- 🌈 UTF-8编码原生支持 UTF-8 编码的中文处理
## 快速开始
### 环境要求
- C++ 编译器:
- g++ (推荐 4.1 以上版本)
- 或 clang++
- cmake (推荐 2.6 以上版本)
### 安装步骤
```sh
wget https://github.com/aszxqw/cppjieba/archive/master.zip -O cppjieba-master.zip
unzip cppjieba-master.zip
cd cppjieba-master
git clone https://github.com/yanyiwu/cppjieba.git
cd cppjieba
git submodule init
git submodule update
mkdir build
cd build
cmake ..
make
sudo make install
make test
```
#### 验证
```sh
/usr/bin/cjseg.sh ../test/testlines.utf8
```
### 启动服务
## 使用示例
```
#启动
/etc/init.d/CppJieba/start.sh
#停止
/etc/init.d/CppJieba/stop.sh
./demo
```
#### 验证服务
结果示例:
然后用chrome浏览器打开`http://127.0.0.1:11200/?key=南京市长江大桥`
(用chrome的原因是chrome的默认编码就是utf-8)
或者用命令 `curl "http://127.0.0.1:11200/?key=南京市长江大桥"` (ubuntu中的curl安装命令`sudo apt-get install curl`)
### 卸载
```sh
cd build/
cat install_manifest.txt | sudo xargs rm -rf
```
[demo] Cut With HMM
他/来到/了/网易/杭研/大厦
[demo] Cut Without HMM
他/来到/了/网易/杭/研/大厦
我来到北京清华大学
[demo] CutAll
我/来到/北京/清华/清华大学/华大/大学
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
[demo] CutForSearch
小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所//后/在/日本/京都/大学/日本京都大学/深造
[demo] Insert User Word
男默/女泪
男默女泪
[demo] CutForSearch Word With Offset
[{"word": "小明", "offset": 0}, {"word": "硕士", "offset": 6}, {"word": "毕业", "offset": 12}, {"word": "于", "offset": 18}, {"word": "中国", "offset": 21}, {"word": "科学", "offset": 27}, {"word": "学院", "offset": 30}, {"word": "科学院", "offset": 27}, {"word": "中国科学院", "offset": 21}, {"word": "计算", "offset": 36}, {"word": "计算所", "offset": 36}, {"word": "", "offset": 45}, {"word": "后", "offset": 48}, {"word": "在", "offset": 51}, {"word": "日本", "offset": 54}, {"word": "京都", "offset": 60}, {"word": "大学", "offset": 66}, {"word": "日本京都大学", "offset": 54}, {"word": "深造", "offset": 72}]
[demo] Tagging
我是拖拉机学院手扶拖拉机专业的。不用多久我就会升职加薪当上CEO走上人生巅峰。
[我:r, 是:v, 拖拉机:n, 学院:n, 手扶拖拉机:n, 专业:n, 的:uj, 。:x, 不用:v, 多久:m, :x, 我:r, 就:d, 会:v, 升职:v, 加薪:nr, :x, 当上:t, CEO:eng, :x, 走上:v, 人生:n, 巅峰:n, 。:x]
[demo] Keyword Extraction
我是拖拉机学院手扶拖拉机专业的。不用多久我就会升职加薪当上CEO走上人生巅峰。
[{"word": "CEO", "offset": [93], "weight": 11.7392}, {"word": "升职", "offset": [72], "weight": 10.8562}, {"word": "加薪", "offset": [78], "weight": 10.6426}, {"word": "手扶拖拉机", "offset": [21], "weight": 10.0089}, {"word": "巅峰", "offset": [111], "weight": 9.49396}]
```
For more details, please see [demo](https://github.com/yanyiwu/cppjieba-demo).
### 分词结果示例
## 分词效果
### MPSegment's demo
**MPSegment**
Output:
```
@ -68,103 +92,154 @@ Output:
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小/明/硕士/毕业/于/中国科学院/计算所//后/在/日本京都大学/深造
我来自北京邮电大学。。。学号091111xx。。。
我/来自/北京邮电大学/。。。/学/号/091111xx/。。。
```
### HMMSegment's demo
**HMMSegment**
Output:
```
我来到北京清华大学
我来/到/北京/清华大学
他来到了网易杭研大厦
他来/到/了/网易/杭/研大厦
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小明/硕士/毕业于/中国/科学院/计算所//后/在/日/本/京/都/大/学/深/造
我来自北京邮电大学。。。学号091111xx。。。
我来/自北京/邮电大学/。。。/学号/091111xx/。。。
```
### MixSegment's demo
**MixSegment**
Output:
```
我来到北京清华大学
我/来到/北京/清华大学
他来到了网易杭研大厦
他/来到/了/网易/杭研/大厦
杭研
杭研
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小明/硕士/毕业/于/中国科学院/计算所//后/在/日本京都大学/深造
我来自北京邮电大学。。。学号091111xx。。。
我/来自/北京邮电大学/。。。/学号/091111xx/。。。
```
### 效果分析
**FullSegment**
```
我来到北京清华大学
我/来到/北京/清华/清华大学/华大/大学
他来到了网易杭研大厦
他/来到/了/网易/杭/研/大厦
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小/明/硕士/毕业/于/中国/中国科学院/科学/科学院/学院/计算/计算所//后/在/日本/日本京都大学/京都/京都大学/大学/深造
```
**QuerySegment**
```
我来到北京清华大学
我/来到/北京/清华/清华大学/华大/大学
他来到了网易杭研大厦
他/来到/了/网易/杭研/大厦
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小明/硕士/毕业/于/中国/中国科学院/科学/科学院/学院/计算所//后/在/中国/中国科学院/科学/科学院/学院/日本/日本京都大学/京都/京都大学/大学/深造
```
以上依次是MP,HMM,Mix三种方法的效果。
可以看出效果最好的是Mix也就是融合MP和HMM的切词算法。即可以准确切出词典已有的词又可以切出像"杭研"这样的未登录词。
Full方法切出所有字典里的词语。
Query方法先使用Mix方法切词对于切出来的较长的词再使用Full方法。
### 自定义用户词典
自定义词典示例请看`dict/user.dict.utf8`
没有使用自定义用户词典时的结果:
```
令狐冲/是/云/计算/行业/的/专家
```
使用自定义用户词典时的结果:
```
令狐冲/是/云计算/行业/的/专家
```
### 关键词抽取
```
我是拖拉机学院手扶拖拉机专业的。不用多久我就会升职加薪当上CEO走上人生巅峰。
["CEO:11.7392", "升职:10.8562", "加薪:10.6426", "手扶拖拉机:10.0089", "巅峰:9.49396"]
```
For more details, please see [demo](https://github.com/yanyiwu/cppjieba-demo).
### 词性标注
```
我是蓝翔技工拖拉机学院手扶拖拉机专业的。不用多久我就会升职加薪当上总经理出任CEO迎娶白富美走上人生巅峰。
["我:r", "是:v", "拖拉机:n", "学院:n", "手扶拖拉机:n", "专业:n", "的:uj", "。:x", "不用:v", "多久:m", ":x", "我:r", "就:d", "会:v", "升职:v", "加薪:nr", ":x", "当上:t", "CEO:eng", ":x", "走上:v", "人生:n", "巅峰:n", "。:x"]
```
For more details, please see [demo](https://github.com/yanyiwu/cppjieba-demo).
支持自定义词性。
比如在(`dict/user.dict.utf8`)增加一行
```
蓝翔 nz
```
结果如下:
```
["我:r", "是:v", "蓝翔:nz", "技工:n", "拖拉机:n", "学院:n", "手扶拖拉机:n", "专业:n", "的:uj", "。:x", "不用:v", "多久:m", ":x", "我:r", "就:d", "会:v", "升职:v", "加薪:nr", ":x", "当:t", "上:f", "总经理:n", ":x", "出任:v", "CEO:eng", ":x", "迎娶:v", "白富美:x", ":x", "走上:v", "人生:n", "巅峰:n", "。:x"]
```
## 其它词典资料分享
+ [dict.367W.utf8] iLife(562193561 at qq.com)
## 生态系统
CppJieba 已经被广泛应用于各种编程语言的分词实现中:
- [GoJieba](https://github.com/yanyiwu/gojieba) - Go 语言版本
- [NodeJieba](https://github.com/yanyiwu/nodejieba) - Node.js 版本
- [CJieba](https://github.com/yanyiwu/cjieba) - C 语言版本
- [jiebaR](https://github.com/qinwf/jiebaR) - R 语言版本
- [exjieba](https://github.com/falood/exjieba) - Erlang 版本
- [jieba_rb](https://github.com/altkatz/jieba_rb) - Ruby 版本
- [iosjieba](https://github.com/yanyiwu/iosjieba) - iOS 版本
- [phpjieba](https://github.com/jonnywang/phpjieba) - PHP 版本
- [perl5-jieba](https://metacpan.org/pod/distribution/Lingua-ZH-Jieba/lib/Lingua/ZH/Jieba.pod) - Perl 版本
### 应用项目
- [simhash](https://github.com/yanyiwu/simhash) - 中文文档相似度计算
- [pg_jieba](https://github.com/jaiminpan/pg_jieba) - PostgreSQL 分词插件
- [gitbook-plugin-search-pro](https://plugins.gitbook.com/plugin/search-pro) - Gitbook 中文搜索插件
- [ngx_http_cppjieba_module](https://github.com/yanyiwu/ngx_http_cppjieba_module) - Nginx 分词插件
## 贡献指南
我们欢迎各种形式的贡献,包括但不限于:
- 提交问题和建议
- 改进文档
- 提交代码修复
- 添加新功能
## 模块详解
本项目主要是如下目录组成:
### src
核心目录,包含主要源代码。
#### Trie树
Trie.cpp/Trie.h 负责载入词典的trie树主要供Segment模块使用。
#### Segment模块
MPSegment.cpp/MPSegment.h
(Maximum Probability)最大概率法:负责根据Trie树构建有向无环图和进行动态规划算法是分词算法的核心。
HMMSegment.cpp/HMMSegment.h
是根据HMM模型来进行分词主要算法思路是根据(B,E,M,S)四个状态来代表每个字的隐藏状态。
HMM模型由dicts/下面的`hmm_model.utf8`提供。
分词算法即viterbi算法。
#### TransCode模块
TransCode.cpp/TransCode.h 负责转换编码类型将utf8和gbk转换成`uint16_t`类型,也负责逆转换。
### src/Husky
提供服务的框架代码,
详见: https://github.com/aszxqw/husky
### src/Limonp
主要是一些工具函数,例如字符串操作等。
直接include就可以使用。
详见: https://github.com/aszxqw/limonp
## 分词速度
### MixSegment
分词速度大概是 62M / 54sec = 1.15M/sec
测试环境: `Intel(R) Xeon(R) CPU E5506 @ 2.13GHz`
## 联系客服
如果有运行问题或者任何疑问,欢迎联系 : wuyanyi09@gmail.com
## 鸣谢
"结巴中文"分词作者: SunJunyi
https://github.com/fxsjy/jieba
顾名思义之所以叫CppJieba是参照SunJunyi大神的Jieba分词Python程序写成的所以饮水思源再次感谢SunJunyi。
如果您觉得 CppJieba 对您有帮助,欢迎 star ⭐️ 支持项目!

View File

@ -1 +0,0 @@
INSTALL(FILES server.conf DESTINATION /etc/CppJieba)

View File

@ -1,16 +0,0 @@
# config
#socket listen port
port=11200
#number of thread
thread_num=4
#demon pid filepath
pid_file=/tmp/cppjieba_server.pid
#dict path
dict_path=/usr/share/CppJieba/dicts/jieba.dict.utf8
#model path
model_path=/usr/share/CppJieba/dicts/hmm_model.utf8

1
deps/limonp vendored Submodule

@ -0,0 +1 @@
Subproject commit 5c82a3f17e4e0adc6a5decfe245054b0ed533d1a

31
dict/README.md Normal file
View File

@ -0,0 +1,31 @@
# CppJieba字典
文件后缀名代表的是词典的编码方式。
比如filename.utf8 是 utf8编码filename.gbk 是 gbk编码方式。
## 分词
### jieba.dict.utf8/gbk
作为最大概率法(MPSegment: Max Probability)分词所使用的词典。
### hmm_model.utf8/gbk
作为隐式马尔科夫模型(HMMSegment: Hidden Markov Model)分词所使用的词典。
__对于MixSegment(混合MPSegment和HMMSegment两者)则同时使用以上两个词典__
## 关键词抽取
### idf.utf8
IDF(Inverse Document Frequency)
在KeywordExtractor中使用的是经典的TF-IDF算法所以需要这么一个词典提供IDF信息。
### stop_words.utf8
停用词词典

258826
dict/idf.utf8 Normal file

File diff suppressed because it is too large Load Diff

View File

@ -312698,7 +312698,6 @@ T恤 4 n
部属 1126 n
部属工作 3 n
部属院校 3 n
部手机 33 n
部族 643 n
部标 4 n
部省级 2 n

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,259 @@
#初始状态的概率
#格式
#状态:概率
B,a:-4.7623052146
B,ad:-6.68006603678
B,ag:-3.14e+100
B,an:-8.69708322302
B,b:-5.01837436211
B,bg:-3.14e+100
B,c:-3.42388018495
B,d:-3.97504752976
B,df:-8.88897423083
B,dg:-3.14e+100
B,e:-8.56355183039
B,en:-3.14e+100
B,f:-5.49163041848
B,g:-3.14e+100
B,h:-13.53336513
B,i:-6.11578472756
B,in:-3.14e+100
B,j:-5.05761912847
B,jn:-3.14e+100
B,k:-3.14e+100
B,l:-4.90588358466
B,ln:-3.14e+100
B,m:-3.6524299819
B,mg:-3.14e+100
B,mq:-6.7869530014
B,n:-1.69662577975
B,ng:-3.14e+100
B,nr:-2.23104959138
B,nrfg:-5.87372217541
B,nrt:-4.98564273352
B,ns:-2.8228438315
B,nt:-4.84609166818
B,nz:-3.94698846058
B,o:-8.43349870215
B,p:-4.20098413209
B,q:-6.99812385896
B,qe:-3.14e+100
B,qg:-3.14e+100
B,r:-3.40981877908
B,rg:-3.14e+100
B,rr:-12.4347528413
B,rz:-7.94611647157
B,s:-5.52267359084
B,t:-3.36474790945
B,tg:-3.14e+100
B,u:-9.1639172775
B,ud:-3.14e+100
B,ug:-3.14e+100
B,uj:-3.14e+100
B,ul:-3.14e+100
B,uv:-3.14e+100
B,uz:-3.14e+100
B,v:-2.67405848743
B,vd:-9.04472876024
B,vg:-3.14e+100
B,vi:-12.4347528413
B,vn:-4.33156108902
B,vq:-12.1470707689
B,w:-3.14e+100
B,x:-3.14e+100
B,y:-9.84448567586
B,yg:-3.14e+100
B,z:-7.04568111149
B,zg:-3.14e+100
E,a:-3.14e+100
E,ad:-3.14e+100
E,ag:-3.14e+100
E,an:-3.14e+100
E,b:-3.14e+100
E,bg:-3.14e+100
E,c:-3.14e+100
E,d:-3.14e+100
E,df:-3.14e+100
E,dg:-3.14e+100
E,e:-3.14e+100
E,en:-3.14e+100
E,f:-3.14e+100
E,g:-3.14e+100
E,h:-3.14e+100
E,i:-3.14e+100
E,in:-3.14e+100
E,j:-3.14e+100
E,jn:-3.14e+100
E,k:-3.14e+100
E,l:-3.14e+100
E,ln:-3.14e+100
E,m:-3.14e+100
E,mg:-3.14e+100
E,mq:-3.14e+100
E,n:-3.14e+100
E,ng:-3.14e+100
E,nr:-3.14e+100
E,nrfg:-3.14e+100
E,nrt:-3.14e+100
E,ns:-3.14e+100
E,nt:-3.14e+100
E,nz:-3.14e+100
E,o:-3.14e+100
E,p:-3.14e+100
E,q:-3.14e+100
E,qe:-3.14e+100
E,qg:-3.14e+100
E,r:-3.14e+100
E,rg:-3.14e+100
E,rr:-3.14e+100
E,rz:-3.14e+100
E,s:-3.14e+100
E,t:-3.14e+100
E,tg:-3.14e+100
E,u:-3.14e+100
E,ud:-3.14e+100
E,ug:-3.14e+100
E,uj:-3.14e+100
E,ul:-3.14e+100
E,uv:-3.14e+100
E,uz:-3.14e+100
E,v:-3.14e+100
E,vd:-3.14e+100
E,vg:-3.14e+100
E,vi:-3.14e+100
E,vn:-3.14e+100
E,vq:-3.14e+100
E,w:-3.14e+100
E,x:-3.14e+100
E,y:-3.14e+100
E,yg:-3.14e+100
E,z:-3.14e+100
E,zg:-3.14e+100
M,a:-3.14e+100
M,ad:-3.14e+100
M,ag:-3.14e+100
M,an:-3.14e+100
M,b:-3.14e+100
M,bg:-3.14e+100
M,c:-3.14e+100
M,d:-3.14e+100
M,df:-3.14e+100
M,dg:-3.14e+100
M,e:-3.14e+100
M,en:-3.14e+100
M,f:-3.14e+100
M,g:-3.14e+100
M,h:-3.14e+100
M,i:-3.14e+100
M,in:-3.14e+100
M,j:-3.14e+100
M,jn:-3.14e+100
M,k:-3.14e+100
M,l:-3.14e+100
M,ln:-3.14e+100
M,m:-3.14e+100
M,mg:-3.14e+100
M,mq:-3.14e+100
M,n:-3.14e+100
M,ng:-3.14e+100
M,nr:-3.14e+100
M,nrfg:-3.14e+100
M,nrt:-3.14e+100
M,ns:-3.14e+100
M,nt:-3.14e+100
M,nz:-3.14e+100
M,o:-3.14e+100
M,p:-3.14e+100
M,q:-3.14e+100
M,qe:-3.14e+100
M,qg:-3.14e+100
M,r:-3.14e+100
M,rg:-3.14e+100
M,rr:-3.14e+100
M,rz:-3.14e+100
M,s:-3.14e+100
M,t:-3.14e+100
M,tg:-3.14e+100
M,u:-3.14e+100
M,ud:-3.14e+100
M,ug:-3.14e+100
M,uj:-3.14e+100
M,ul:-3.14e+100
M,uv:-3.14e+100
M,uz:-3.14e+100
M,v:-3.14e+100
M,vd:-3.14e+100
M,vg:-3.14e+100
M,vi:-3.14e+100
M,vn:-3.14e+100
M,vq:-3.14e+100
M,w:-3.14e+100
M,x:-3.14e+100
M,y:-3.14e+100
M,yg:-3.14e+100
M,z:-3.14e+100
M,zg:-3.14e+100
S,a:-3.90253968313
S,ad:-11.0484584802
S,ag:-6.95411391796
S,an:-12.8402179494
S,b:-6.47288876397
S,bg:-3.14e+100
S,c:-4.78696679586
S,d:-3.90391976418
S,df:-3.14e+100
S,dg:-8.9483976513
S,e:-5.94251300628
S,en:-3.14e+100
S,f:-5.19482024998
S,g:-6.50782681533
S,h:-8.65056320738
S,i:-3.14e+100
S,in:-3.14e+100
S,j:-4.91199211964
S,jn:-3.14e+100
S,k:-6.94032059583
S,l:-3.14e+100
S,ln:-3.14e+100
S,m:-3.26920065212
S,mg:-10.8253149289
S,mq:-3.14e+100
S,n:-3.85514838976
S,ng:-4.9134348611
S,nr:-4.48366310396
S,nrfg:-3.14e+100
S,nrt:-3.14e+100
S,ns:-3.14e+100
S,nt:-12.1470707689
S,nz:-3.14e+100
S,o:-8.46446092775
S,p:-2.98684018136
S,q:-4.88865861826
S,qe:-3.14e+100
S,qg:-3.14e+100
S,r:-2.76353367841
S,rg:-10.2752685919
S,rr:-3.14e+100
S,rz:-3.14e+100
S,s:-3.14e+100
S,t:-3.14e+100
S,tg:-6.27284253188
S,u:-6.94032059583
S,ud:-7.72823016105
S,ug:-7.53940370266
S,uj:-6.85251045118
S,ul:-8.41537131755
S,uv:-8.15808672229
S,uz:-9.29925862537
S,v:-3.05329230341
S,vd:-3.14e+100
S,vg:-5.94301818437
S,vi:-3.14e+100
S,vn:-11.4539235883
S,vq:-3.14e+100
S,w:-3.14e+100
S,x:-8.42741965607
S,y:-6.19707946995
S,yg:-13.53336513
S,z:-3.14e+100
S,zg:-3.14e+100

File diff suppressed because it is too large Load Diff

1534
dict/stop_words.utf8 Normal file

File diff suppressed because it is too large Load Diff

4
dict/user.dict.utf8 Normal file
View File

@ -0,0 +1,4 @@
云计算
韩玉鉴赏
蓝翔 nz
区块链 10 nz

View File

@ -1 +0,0 @@
INSTALL(FILES hmm_model.utf8 jieba.dict.utf8 DESTINATION share/CppJieba/dicts)

View File

@ -0,0 +1,280 @@
#ifndef CPPJIEBA_DICT_TRIE_HPP
#define CPPJIEBA_DICT_TRIE_HPP
#include <algorithm>
#include <fstream>
#include <cstring>
#include <cstdlib>
#include <cmath>
#include <deque>
#include <set>
#include <string>
#include <unordered_set>
#include "limonp/StringUtil.hpp"
#include "limonp/Logging.hpp"
#include "Unicode.hpp"
#include "Trie.hpp"
namespace cppjieba {
const double MIN_DOUBLE = -3.14e+100;
const double MAX_DOUBLE = 3.14e+100;
const size_t DICT_COLUMN_NUM = 3;
const char* const UNKNOWN_TAG = "";
class DictTrie {
public:
enum UserWordWeightOption {
WordWeightMin,
WordWeightMedian,
WordWeightMax,
}; // enum UserWordWeightOption
DictTrie(const std::string& dict_path, const std::string& user_dict_paths = "", UserWordWeightOption user_word_weight_opt = WordWeightMedian) {
Init(dict_path, user_dict_paths, user_word_weight_opt);
}
~DictTrie() {
delete trie_;
}
bool InsertUserWord(const std::string& word, const std::string& tag = UNKNOWN_TAG) {
DictUnit node_info;
if (!MakeNodeInfo(node_info, word, user_word_default_weight_, tag)) {
return false;
}
active_node_infos_.push_back(node_info);
trie_->InsertNode(node_info.word, &active_node_infos_.back());
return true;
}
bool InsertUserWord(const std::string& word,int freq, const std::string& tag = UNKNOWN_TAG) {
DictUnit node_info;
double weight = freq ? log(1.0 * freq / freq_sum_) : user_word_default_weight_ ;
if (!MakeNodeInfo(node_info, word, weight , tag)) {
return false;
}
active_node_infos_.push_back(node_info);
trie_->InsertNode(node_info.word, &active_node_infos_.back());
return true;
}
bool DeleteUserWord(const std::string& word, const std::string& tag = UNKNOWN_TAG) {
DictUnit node_info;
if (!MakeNodeInfo(node_info, word, user_word_default_weight_, tag)) {
return false;
}
trie_->DeleteNode(node_info.word, &node_info);
return true;
}
const DictUnit* Find(RuneStrArray::const_iterator begin, RuneStrArray::const_iterator end) const {
return trie_->Find(begin, end);
}
void Find(RuneStrArray::const_iterator begin,
RuneStrArray::const_iterator end,
std::vector<struct Dag>&res,
size_t max_word_len = MAX_WORD_LENGTH) const {
trie_->Find(begin, end, res, max_word_len);
}
bool Find(const std::string& word)
{
const DictUnit *tmp = NULL;
RuneStrArray runes;
if (!DecodeUTF8RunesInString(word, runes))
{
XLOG(ERROR) << "Decode failed.";
}
tmp = Find(runes.begin(), runes.end());
if (tmp == NULL)
{
return false;
}
else
{
return true;
}
}
bool IsUserDictSingleChineseWord(const Rune& word) const {
return IsIn(user_dict_single_chinese_word_, word);
}
double GetMinWeight() const {
return min_weight_;
}
void InserUserDictNode(const std::string& line) {
std::vector<std::string> buf;
DictUnit node_info;
limonp::Split(line, buf, " ");
if(buf.size() == 1){
MakeNodeInfo(node_info,
buf[0],
user_word_default_weight_,
UNKNOWN_TAG);
} else if (buf.size() == 2) {
MakeNodeInfo(node_info,
buf[0],
user_word_default_weight_,
buf[1]);
} else if (buf.size() == 3) {
int freq = atoi(buf[1].c_str());
assert(freq_sum_ > 0.0);
double weight = log(1.0 * freq / freq_sum_);
MakeNodeInfo(node_info, buf[0], weight, buf[2]);
}
static_node_infos_.push_back(node_info);
if (node_info.word.size() == 1) {
user_dict_single_chinese_word_.insert(node_info.word[0]);
}
}
void LoadUserDict(const std::vector<std::string>& buf) {
for (size_t i = 0; i < buf.size(); i++) {
InserUserDictNode(buf[i]);
}
}
void LoadUserDict(const std::set<std::string>& buf) {
std::set<std::string>::const_iterator iter;
for (iter = buf.begin(); iter != buf.end(); iter++){
InserUserDictNode(*iter);
}
}
void LoadUserDict(const std::string& filePaths) {
std::vector<std::string> files = limonp::Split(filePaths, "|;");
for (size_t i = 0; i < files.size(); i++) {
std::ifstream ifs(files[i].c_str());
XCHECK(ifs.is_open()) << "open " << files[i] << " failed";
std::string line;
while(getline(ifs, line)) {
if (line.size() == 0) {
continue;
}
InserUserDictNode(line);
}
}
}
private:
void Init(const std::string& dict_path, const std::string& user_dict_paths, UserWordWeightOption user_word_weight_opt) {
LoadDict(dict_path);
freq_sum_ = CalcFreqSum(static_node_infos_);
CalculateWeight(static_node_infos_, freq_sum_);
SetStaticWordWeights(user_word_weight_opt);
if (user_dict_paths.size()) {
LoadUserDict(user_dict_paths);
}
Shrink(static_node_infos_);
CreateTrie(static_node_infos_);
}
void CreateTrie(const std::vector<DictUnit>& dictUnits) {
assert(dictUnits.size());
std::vector<Unicode> words;
std::vector<const DictUnit*> valuePointers;
for (size_t i = 0 ; i < dictUnits.size(); i ++) {
words.push_back(dictUnits[i].word);
valuePointers.push_back(&dictUnits[i]);
}
trie_ = new Trie(words, valuePointers);
}
bool MakeNodeInfo(DictUnit& node_info,
const std::string& word,
double weight,
const std::string& tag) {
if (!DecodeUTF8RunesInString(word, node_info.word)) {
XLOG(ERROR) << "UTF-8 decode failed for dict word: " << word;
return false;
}
node_info.weight = weight;
node_info.tag = tag;
return true;
}
void LoadDict(const std::string& filePath) {
std::ifstream ifs(filePath.c_str());
XCHECK(ifs.is_open()) << "open " << filePath << " failed.";
std::string line;
std::vector<std::string> buf;
DictUnit node_info;
while (getline(ifs, line)) {
limonp::Split(line, buf, " ");
XCHECK(buf.size() == DICT_COLUMN_NUM) << "split result illegal, line:" << line;
MakeNodeInfo(node_info,
buf[0],
atof(buf[1].c_str()),
buf[2]);
static_node_infos_.push_back(node_info);
}
}
static bool WeightCompare(const DictUnit& lhs, const DictUnit& rhs) {
return lhs.weight < rhs.weight;
}
void SetStaticWordWeights(UserWordWeightOption option) {
XCHECK(!static_node_infos_.empty());
std::vector<DictUnit> x = static_node_infos_;
std::sort(x.begin(), x.end(), WeightCompare);
min_weight_ = x[0].weight;
max_weight_ = x[x.size() - 1].weight;
median_weight_ = x[x.size() / 2].weight;
switch (option) {
case WordWeightMin:
user_word_default_weight_ = min_weight_;
break;
case WordWeightMedian:
user_word_default_weight_ = median_weight_;
break;
default:
user_word_default_weight_ = max_weight_;
break;
}
}
double CalcFreqSum(const std::vector<DictUnit>& node_infos) const {
double sum = 0.0;
for (size_t i = 0; i < node_infos.size(); i++) {
sum += node_infos[i].weight;
}
return sum;
}
void CalculateWeight(std::vector<DictUnit>& node_infos, double sum) const {
assert(sum > 0.0);
for (size_t i = 0; i < node_infos.size(); i++) {
DictUnit& node_info = node_infos[i];
assert(node_info.weight > 0.0);
node_info.weight = log(double(node_info.weight)/sum);
}
}
void Shrink(std::vector<DictUnit>& units) const {
std::vector<DictUnit>(units.begin(), units.end()).swap(units);
}
std::vector<DictUnit> static_node_infos_;
std::deque<DictUnit> active_node_infos_; // must not be std::vector
Trie * trie_;
double freq_sum_;
double min_weight_;
double max_weight_;
double median_weight_;
double user_word_default_weight_;
std::unordered_set<Rune> user_dict_single_chinese_word_;
};
}
#endif

View File

@ -0,0 +1,93 @@
#ifndef CPPJIEBA_FULLSEGMENT_H
#define CPPJIEBA_FULLSEGMENT_H
#include <algorithm>
#include <set>
#include <cassert>
#include "limonp/Logging.hpp"
#include "DictTrie.hpp"
#include "SegmentBase.hpp"
#include "Unicode.hpp"
namespace cppjieba {
class FullSegment: public SegmentBase {
public:
FullSegment(const string& dictPath) {
dictTrie_ = new DictTrie(dictPath);
isNeedDestroy_ = true;
}
FullSegment(const DictTrie* dictTrie)
: dictTrie_(dictTrie), isNeedDestroy_(false) {
assert(dictTrie_);
}
~FullSegment() {
if (isNeedDestroy_) {
delete dictTrie_;
}
}
void Cut(const string& sentence,
vector<string>& words) const {
vector<Word> tmp;
Cut(sentence, tmp);
GetStringsFromWords(tmp, words);
}
void Cut(const string& sentence,
vector<Word>& words) const {
PreFilter pre_filter(symbols_, sentence);
PreFilter::Range range;
vector<WordRange> wrs;
wrs.reserve(sentence.size()/2);
while (pre_filter.HasNext()) {
range = pre_filter.Next();
Cut(range.begin, range.end, wrs);
}
words.clear();
words.reserve(wrs.size());
GetWordsFromWordRanges(sentence, wrs, words);
}
void Cut(RuneStrArray::const_iterator begin,
RuneStrArray::const_iterator end,
vector<WordRange>& res) const {
// result of searching in trie tree
LocalVector<pair<size_t, const DictUnit*> > tRes;
// max index of res's words
size_t maxIdx = 0;
// always equals to (uItr - begin)
size_t uIdx = 0;
// tmp variables
size_t wordLen = 0;
assert(dictTrie_);
vector<struct Dag> dags;
dictTrie_->Find(begin, end, dags);
for (size_t i = 0; i < dags.size(); i++) {
for (size_t j = 0; j < dags[i].nexts.size(); j++) {
size_t nextoffset = dags[i].nexts[j].first;
assert(nextoffset < dags.size());
const DictUnit* du = dags[i].nexts[j].second;
if (du == NULL) {
if (dags[i].nexts.size() == 1 && maxIdx <= uIdx) {
WordRange wr(begin + i, begin + nextoffset);
res.push_back(wr);
}
} else {
wordLen = du->word.size();
if (wordLen >= 2 || (dags[i].nexts.size() == 1 && maxIdx <= uIdx)) {
WordRange wr(begin + i, begin + nextoffset);
res.push_back(wr);
}
}
maxIdx = uIdx + wordLen > maxIdx ? uIdx + wordLen : maxIdx;
}
uIdx++;
}
}
private:
const DictTrie* dictTrie_;
bool isNeedDestroy_;
};
}
#endif

View File

@ -0,0 +1,129 @@
#ifndef CPPJIEBA_HMMMODEL_H
#define CPPJIEBA_HMMMODEL_H
#include "limonp/StringUtil.hpp"
#include "Trie.hpp"
namespace cppjieba {
using namespace limonp;
typedef unordered_map<Rune, double> EmitProbMap;
struct HMMModel {
/*
* STATUS:
* 0: HMMModel::B, 1: HMMModel::E, 2: HMMModel::M, 3:HMMModel::S
* */
enum {B = 0, E = 1, M = 2, S = 3, STATUS_SUM = 4};
HMMModel(const string& modelPath) {
memset(startProb, 0, sizeof(startProb));
memset(transProb, 0, sizeof(transProb));
statMap[0] = 'B';
statMap[1] = 'E';
statMap[2] = 'M';
statMap[3] = 'S';
emitProbVec.push_back(&emitProbB);
emitProbVec.push_back(&emitProbE);
emitProbVec.push_back(&emitProbM);
emitProbVec.push_back(&emitProbS);
LoadModel(modelPath);
}
~HMMModel() {
}
void LoadModel(const string& filePath) {
ifstream ifile(filePath.c_str());
XCHECK(ifile.is_open()) << "open " << filePath << " failed";
string line;
vector<string> tmp;
vector<string> tmp2;
//Load startProb
XCHECK(GetLine(ifile, line));
Split(line, tmp, " ");
XCHECK(tmp.size() == STATUS_SUM);
for (size_t j = 0; j< tmp.size(); j++) {
startProb[j] = atof(tmp[j].c_str());
}
//Load transProb
for (size_t i = 0; i < STATUS_SUM; i++) {
XCHECK(GetLine(ifile, line));
Split(line, tmp, " ");
XCHECK(tmp.size() == STATUS_SUM);
for (size_t j =0; j < STATUS_SUM; j++) {
transProb[i][j] = atof(tmp[j].c_str());
}
}
//Load emitProbB
XCHECK(GetLine(ifile, line));
XCHECK(LoadEmitProb(line, emitProbB));
//Load emitProbE
XCHECK(GetLine(ifile, line));
XCHECK(LoadEmitProb(line, emitProbE));
//Load emitProbM
XCHECK(GetLine(ifile, line));
XCHECK(LoadEmitProb(line, emitProbM));
//Load emitProbS
XCHECK(GetLine(ifile, line));
XCHECK(LoadEmitProb(line, emitProbS));
}
double GetEmitProb(const EmitProbMap* ptMp, Rune key,
double defVal)const {
EmitProbMap::const_iterator cit = ptMp->find(key);
if (cit == ptMp->end()) {
return defVal;
}
return cit->second;
}
bool GetLine(ifstream& ifile, string& line) {
while (getline(ifile, line)) {
Trim(line);
if (line.empty()) {
continue;
}
if (StartsWith(line, "#")) {
continue;
}
return true;
}
return false;
}
bool LoadEmitProb(const string& line, EmitProbMap& mp) {
if (line.empty()) {
return false;
}
vector<string> tmp, tmp2;
Unicode unicode;
Split(line, tmp, ",");
for (size_t i = 0; i < tmp.size(); i++) {
Split(tmp[i], tmp2, ":");
if (2 != tmp2.size()) {
XLOG(ERROR) << "emitProb illegal.";
return false;
}
if (!DecodeUTF8RunesInString(tmp2[0], unicode) || unicode.size() != 1) {
XLOG(ERROR) << "TransCode failed.";
return false;
}
mp[unicode[0]] = atof(tmp2[1].c_str());
}
return true;
}
char statMap[STATUS_SUM];
double startProb[STATUS_SUM];
double transProb[STATUS_SUM][STATUS_SUM];
EmitProbMap emitProbB;
EmitProbMap emitProbE;
EmitProbMap emitProbM;
EmitProbMap emitProbS;
vector<EmitProbMap* > emitProbVec;
}; // struct HMMModel
} // namespace cppjieba
#endif

View File

@ -0,0 +1,190 @@
#ifndef CPPJIBEA_HMMSEGMENT_H
#define CPPJIBEA_HMMSEGMENT_H
#include <iostream>
#include <fstream>
#include <memory.h>
#include <cassert>
#include "HMMModel.hpp"
#include "SegmentBase.hpp"
namespace cppjieba {
class HMMSegment: public SegmentBase {
public:
HMMSegment(const string& filePath)
: model_(new HMMModel(filePath)), isNeedDestroy_(true) {
}
HMMSegment(const HMMModel* model)
: model_(model), isNeedDestroy_(false) {
}
~HMMSegment() {
if (isNeedDestroy_) {
delete model_;
}
}
void Cut(const string& sentence,
vector<string>& words) const {
vector<Word> tmp;
Cut(sentence, tmp);
GetStringsFromWords(tmp, words);
}
void Cut(const string& sentence,
vector<Word>& words) const {
PreFilter pre_filter(symbols_, sentence);
PreFilter::Range range;
vector<WordRange> wrs;
wrs.reserve(sentence.size()/2);
while (pre_filter.HasNext()) {
range = pre_filter.Next();
Cut(range.begin, range.end, wrs);
}
words.clear();
words.reserve(wrs.size());
GetWordsFromWordRanges(sentence, wrs, words);
}
void Cut(RuneStrArray::const_iterator begin, RuneStrArray::const_iterator end, vector<WordRange>& res) const {
RuneStrArray::const_iterator left = begin;
RuneStrArray::const_iterator right = begin;
while (right != end) {
if (right->rune < 0x80) {
if (left != right) {
InternalCut(left, right, res);
}
left = right;
do {
right = SequentialLetterRule(left, end);
if (right != left) {
break;
}
right = NumbersRule(left, end);
if (right != left) {
break;
}
right ++;
} while (false);
WordRange wr(left, right - 1);
res.push_back(wr);
left = right;
} else {
right++;
}
}
if (left != right) {
InternalCut(left, right, res);
}
}
private:
// sequential letters rule
RuneStrArray::const_iterator SequentialLetterRule(RuneStrArray::const_iterator begin, RuneStrArray::const_iterator end) const {
Rune x = begin->rune;
if (('a' <= x && x <= 'z') || ('A' <= x && x <= 'Z')) {
begin ++;
} else {
return begin;
}
while (begin != end) {
x = begin->rune;
if (('a' <= x && x <= 'z') || ('A' <= x && x <= 'Z') || ('0' <= x && x <= '9')) {
begin ++;
} else {
break;
}
}
return begin;
}
//
RuneStrArray::const_iterator NumbersRule(RuneStrArray::const_iterator begin, RuneStrArray::const_iterator end) const {
Rune x = begin->rune;
if ('0' <= x && x <= '9') {
begin ++;
} else {
return begin;
}
while (begin != end) {
x = begin->rune;
if ( ('0' <= x && x <= '9') || x == '.') {
begin++;
} else {
break;
}
}
return begin;
}
void InternalCut(RuneStrArray::const_iterator begin, RuneStrArray::const_iterator end, vector<WordRange>& res) const {
vector<size_t> status;
Viterbi(begin, end, status);
RuneStrArray::const_iterator left = begin;
RuneStrArray::const_iterator right;
for (size_t i = 0; i < status.size(); i++) {
if (status[i] % 2) { //if (HMMModel::E == status[i] || HMMModel::S == status[i])
right = begin + i + 1;
WordRange wr(left, right - 1);
res.push_back(wr);
left = right;
}
}
}
void Viterbi(RuneStrArray::const_iterator begin,
RuneStrArray::const_iterator end,
vector<size_t>& status) const {
size_t Y = HMMModel::STATUS_SUM;
size_t X = end - begin;
size_t XYSize = X * Y;
size_t now, old, stat;
double tmp, endE, endS;
vector<int> path(XYSize);
vector<double> weight(XYSize);
//start
for (size_t y = 0; y < Y; y++) {
weight[0 + y * X] = model_->startProb[y] + model_->GetEmitProb(model_->emitProbVec[y], begin->rune, MIN_DOUBLE);
path[0 + y * X] = -1;
}
double emitProb;
for (size_t x = 1; x < X; x++) {
for (size_t y = 0; y < Y; y++) {
now = x + y*X;
weight[now] = MIN_DOUBLE;
path[now] = HMMModel::E; // warning
emitProb = model_->GetEmitProb(model_->emitProbVec[y], (begin+x)->rune, MIN_DOUBLE);
for (size_t preY = 0; preY < Y; preY++) {
old = x - 1 + preY * X;
tmp = weight[old] + model_->transProb[preY][y] + emitProb;
if (tmp > weight[now]) {
weight[now] = tmp;
path[now] = preY;
}
}
}
}
endE = weight[X-1+HMMModel::E*X];
endS = weight[X-1+HMMModel::S*X];
stat = 0;
if (endE >= endS) {
stat = HMMModel::E;
} else {
stat = HMMModel::S;
}
status.resize(X);
for (int x = X -1 ; x >= 0; x--) {
status[x] = stat;
stat = path[x + stat*X];
}
}
const HMMModel* model_;
bool isNeedDestroy_;
}; // class HMMSegment
} // namespace cppjieba
#endif

169
include/cppjieba/Jieba.hpp Normal file
View File

@ -0,0 +1,169 @@
#ifndef CPPJIEAB_JIEBA_H
#define CPPJIEAB_JIEBA_H
#include "QuerySegment.hpp"
#include "KeywordExtractor.hpp"
namespace cppjieba {
class Jieba {
public:
Jieba(const string& dict_path = "",
const string& model_path = "",
const string& user_dict_path = "",
const string& idf_path = "",
const string& stop_word_path = "")
: dict_trie_(getPath(dict_path, "jieba.dict.utf8"), getPath(user_dict_path, "user.dict.utf8")),
model_(getPath(model_path, "hmm_model.utf8")),
mp_seg_(&dict_trie_),
hmm_seg_(&model_),
mix_seg_(&dict_trie_, &model_),
full_seg_(&dict_trie_),
query_seg_(&dict_trie_, &model_),
extractor(&dict_trie_, &model_,
getPath(idf_path, "idf.utf8"),
getPath(stop_word_path, "stop_words.utf8")) {
}
~Jieba() {
}
struct LocWord {
string word;
size_t begin;
size_t end;
}; // struct LocWord
void Cut(const string& sentence, vector<string>& words, bool hmm = true) const {
mix_seg_.Cut(sentence, words, hmm);
}
void Cut(const string& sentence, vector<Word>& words, bool hmm = true) const {
mix_seg_.Cut(sentence, words, hmm);
}
void CutAll(const string& sentence, vector<string>& words) const {
full_seg_.Cut(sentence, words);
}
void CutAll(const string& sentence, vector<Word>& words) const {
full_seg_.Cut(sentence, words);
}
void CutForSearch(const string& sentence, vector<string>& words, bool hmm = true) const {
query_seg_.Cut(sentence, words, hmm);
}
void CutForSearch(const string& sentence, vector<Word>& words, bool hmm = true) const {
query_seg_.Cut(sentence, words, hmm);
}
void CutHMM(const string& sentence, vector<string>& words) const {
hmm_seg_.Cut(sentence, words);
}
void CutHMM(const string& sentence, vector<Word>& words) const {
hmm_seg_.Cut(sentence, words);
}
void CutSmall(const string& sentence, vector<string>& words, size_t max_word_len) const {
mp_seg_.Cut(sentence, words, max_word_len);
}
void CutSmall(const string& sentence, vector<Word>& words, size_t max_word_len) const {
mp_seg_.Cut(sentence, words, max_word_len);
}
void Tag(const string& sentence, vector<pair<string, string> >& words) const {
mix_seg_.Tag(sentence, words);
}
string LookupTag(const string &str) const {
return mix_seg_.LookupTag(str);
}
bool InsertUserWord(const string& word, const string& tag = UNKNOWN_TAG) {
return dict_trie_.InsertUserWord(word, tag);
}
bool InsertUserWord(const string& word,int freq, const string& tag = UNKNOWN_TAG) {
return dict_trie_.InsertUserWord(word,freq, tag);
}
bool DeleteUserWord(const string& word, const string& tag = UNKNOWN_TAG) {
return dict_trie_.DeleteUserWord(word, tag);
}
bool Find(const string& word)
{
return dict_trie_.Find(word);
}
void ResetSeparators(const string& s) {
//TODO
mp_seg_.ResetSeparators(s);
hmm_seg_.ResetSeparators(s);
mix_seg_.ResetSeparators(s);
full_seg_.ResetSeparators(s);
query_seg_.ResetSeparators(s);
}
const DictTrie* GetDictTrie() const {
return &dict_trie_;
}
const HMMModel* GetHMMModel() const {
return &model_;
}
void LoadUserDict(const vector<string>& buf) {
dict_trie_.LoadUserDict(buf);
}
void LoadUserDict(const set<string>& buf) {
dict_trie_.LoadUserDict(buf);
}
void LoadUserDict(const string& path) {
dict_trie_.LoadUserDict(path);
}
private:
static string pathJoin(const string& dir, const string& filename) {
if (dir.empty()) {
return filename;
}
char last_char = dir[dir.length() - 1];
if (last_char == '/' || last_char == '\\') {
return dir + filename;
} else {
#ifdef _WIN32
return dir + '\\' + filename;
#else
return dir + '/' + filename;
#endif
}
}
static string getCurrentDirectory() {
string path(__FILE__);
size_t pos = path.find_last_of("/\\");
return (pos == string::npos) ? "" : path.substr(0, pos);
}
static string getPath(const string& path, const string& default_file) {
if (path.empty()) {
string current_dir = getCurrentDirectory();
string parent_dir = current_dir.substr(0, current_dir.find_last_of("/\\"));
string grandparent_dir = parent_dir.substr(0, parent_dir.find_last_of("/\\"));
return pathJoin(pathJoin(grandparent_dir, "dict"), default_file);
}
return path;
}
DictTrie dict_trie_;
HMMModel model_;
// They share the same dict trie and model
MPSegment mp_seg_;
HMMSegment hmm_seg_;
MixSegment mix_seg_;
FullSegment full_seg_;
QuerySegment query_seg_;
public:
KeywordExtractor extractor;
}; // class Jieba
} // namespace cppjieba
#endif // CPPJIEAB_JIEBA_H

View File

@ -0,0 +1,149 @@
#ifndef CPPJIEBA_KEYWORD_EXTRACTOR_H
#define CPPJIEBA_KEYWORD_EXTRACTOR_H
#include <algorithm>
#include <unordered_map>
#include <unordered_set>
#include "MixSegment.hpp"
namespace cppjieba {
/*utf8*/
class KeywordExtractor {
public:
struct Word {
std::string word;
std::vector<size_t> offsets;
double weight;
}; // struct Word
KeywordExtractor(const std::string& dictPath,
const std::string& hmmFilePath,
const std::string& idfPath,
const std::string& stopWordPath,
const std::string& userDict = "")
: segment_(dictPath, hmmFilePath, userDict) {
LoadIdfDict(idfPath);
LoadStopWordDict(stopWordPath);
}
KeywordExtractor(const DictTrie* dictTrie,
const HMMModel* model,
const std::string& idfPath,
const std::string& stopWordPath)
: segment_(dictTrie, model) {
LoadIdfDict(idfPath);
LoadStopWordDict(stopWordPath);
}
~KeywordExtractor() {
}
void Extract(const std::string& sentence, std::vector<std::string>& keywords, size_t topN) const {
std::vector<Word> topWords;
Extract(sentence, topWords, topN);
for (size_t i = 0; i < topWords.size(); i++) {
keywords.push_back(topWords[i].word);
}
}
void Extract(const std::string& sentence, std::vector<pair<std::string, double> >& keywords, size_t topN) const {
std::vector<Word> topWords;
Extract(sentence, topWords, topN);
for (size_t i = 0; i < topWords.size(); i++) {
keywords.push_back(pair<std::string, double>(topWords[i].word, topWords[i].weight));
}
}
void Extract(const std::string& sentence, std::vector<Word>& keywords, size_t topN) const {
std::vector<std::string> words;
segment_.Cut(sentence, words);
std::map<std::string, Word> wordmap;
size_t offset = 0;
for (size_t i = 0; i < words.size(); ++i) {
size_t t = offset;
offset += words[i].size();
if (IsSingleWord(words[i]) || stopWords_.find(words[i]) != stopWords_.end()) {
continue;
}
wordmap[words[i]].offsets.push_back(t);
wordmap[words[i]].weight += 1.0;
}
if (offset != sentence.size()) {
XLOG(ERROR) << "words illegal";
return;
}
keywords.clear();
keywords.reserve(wordmap.size());
for (std::map<std::string, Word>::iterator itr = wordmap.begin(); itr != wordmap.end(); ++itr) {
std::unordered_map<std::string, double>::const_iterator cit = idfMap_.find(itr->first);
if (cit != idfMap_.end()) {
itr->second.weight *= cit->second;
} else {
itr->second.weight *= idfAverage_;
}
itr->second.word = itr->first;
keywords.push_back(itr->second);
}
topN = min(topN, keywords.size());
std::partial_sort(keywords.begin(), keywords.begin() + topN, keywords.end(), Compare);
keywords.resize(topN);
}
private:
void LoadIdfDict(const std::string& idfPath) {
std::ifstream ifs(idfPath.c_str());
XCHECK(ifs.is_open()) << "open " << idfPath << " failed";
std::string line ;
std::vector<std::string> buf;
double idf = 0.0;
double idfSum = 0.0;
size_t lineno = 0;
for (; getline(ifs, line); lineno++) {
buf.clear();
if (line.empty()) {
XLOG(ERROR) << "lineno: " << lineno << " empty. skipped.";
continue;
}
limonp::Split(line, buf, " ");
if (buf.size() != 2) {
XLOG(ERROR) << "line: " << line << ", lineno: " << lineno << " empty. skipped.";
continue;
}
idf = atof(buf[1].c_str());
idfMap_[buf[0]] = idf;
idfSum += idf;
}
assert(lineno);
idfAverage_ = idfSum / lineno;
assert(idfAverage_ > 0.0);
}
void LoadStopWordDict(const std::string& filePath) {
std::ifstream ifs(filePath.c_str());
XCHECK(ifs.is_open()) << "open " << filePath << " failed";
std::string line ;
while (getline(ifs, line)) {
stopWords_.insert(line);
}
assert(stopWords_.size());
}
static bool Compare(const Word& lhs, const Word& rhs) {
return lhs.weight > rhs.weight;
}
MixSegment segment_;
std::unordered_map<std::string, double> idfMap_;
double idfAverage_;
std::unordered_set<std::string> stopWords_;
}; // class KeywordExtractor
inline std::ostream& operator << (std::ostream& os, const KeywordExtractor::Word& word) {
return os << "{\"word\": \"" << word.word << "\", \"offset\": " << word.offsets << ", \"weight\": " << word.weight << "}";
}
} // namespace cppjieba
#endif

View File

@ -0,0 +1,137 @@
#ifndef CPPJIEBA_MPSEGMENT_H
#define CPPJIEBA_MPSEGMENT_H
#include <algorithm>
#include <set>
#include <cassert>
#include "limonp/Logging.hpp"
#include "DictTrie.hpp"
#include "SegmentTagged.hpp"
#include "PosTagger.hpp"
namespace cppjieba {
class MPSegment: public SegmentTagged {
public:
MPSegment(const string& dictPath, const string& userDictPath = "")
: dictTrie_(new DictTrie(dictPath, userDictPath)), isNeedDestroy_(true) {
}
MPSegment(const DictTrie* dictTrie)
: dictTrie_(dictTrie), isNeedDestroy_(false) {
assert(dictTrie_);
}
~MPSegment() {
if (isNeedDestroy_) {
delete dictTrie_;
}
}
void Cut(const string& sentence, vector<string>& words) const {
Cut(sentence, words, MAX_WORD_LENGTH);
}
void Cut(const string& sentence,
vector<string>& words,
size_t max_word_len) const {
vector<Word> tmp;
Cut(sentence, tmp, max_word_len);
GetStringsFromWords(tmp, words);
}
void Cut(const string& sentence,
vector<Word>& words,
size_t max_word_len = MAX_WORD_LENGTH) const {
PreFilter pre_filter(symbols_, sentence);
PreFilter::Range range;
vector<WordRange> wrs;
wrs.reserve(sentence.size()/2);
while (pre_filter.HasNext()) {
range = pre_filter.Next();
Cut(range.begin, range.end, wrs, max_word_len);
}
words.clear();
words.reserve(wrs.size());
GetWordsFromWordRanges(sentence, wrs, words);
}
void Cut(RuneStrArray::const_iterator begin,
RuneStrArray::const_iterator end,
vector<WordRange>& words,
size_t max_word_len = MAX_WORD_LENGTH) const {
vector<Dag> dags;
dictTrie_->Find(begin,
end,
dags,
max_word_len);
CalcDP(dags);
CutByDag(begin, end, dags, words);
}
const DictTrie* GetDictTrie() const {
return dictTrie_;
}
bool Tag(const string& src, vector<pair<string, string> >& res) const {
return tagger_.Tag(src, res, *this);
}
bool IsUserDictSingleChineseWord(const Rune& value) const {
return dictTrie_->IsUserDictSingleChineseWord(value);
}
private:
void CalcDP(vector<Dag>& dags) const {
size_t nextPos;
const DictUnit* p;
double val;
for (vector<Dag>::reverse_iterator rit = dags.rbegin(); rit != dags.rend(); rit++) {
rit->pInfo = NULL;
rit->weight = MIN_DOUBLE;
assert(!rit->nexts.empty());
for (LocalVector<pair<size_t, const DictUnit*> >::const_iterator it = rit->nexts.begin(); it != rit->nexts.end(); it++) {
nextPos = it->first;
p = it->second;
val = 0.0;
if (nextPos + 1 < dags.size()) {
val += dags[nextPos + 1].weight;
}
if (p) {
val += p->weight;
} else {
val += dictTrie_->GetMinWeight();
}
if (val > rit->weight) {
rit->pInfo = p;
rit->weight = val;
}
}
}
}
void CutByDag(RuneStrArray::const_iterator begin,
RuneStrArray::const_iterator end,
const vector<Dag>& dags,
vector<WordRange>& words) const {
size_t i = 0;
while (i < dags.size()) {
const DictUnit* p = dags[i].pInfo;
if (p) {
assert(p->word.size() >= 1);
WordRange wr(begin + i, begin + i + p->word.size() - 1);
words.push_back(wr);
i += p->word.size();
} else { //single chinese word
WordRange wr(begin + i, begin + i);
words.push_back(wr);
i++;
}
}
}
const DictTrie* dictTrie_;
bool isNeedDestroy_;
PosTagger tagger_;
}; // class MPSegment
} // namespace cppjieba
#endif

View File

@ -0,0 +1,109 @@
#ifndef CPPJIEBA_MIXSEGMENT_H
#define CPPJIEBA_MIXSEGMENT_H
#include <cassert>
#include "MPSegment.hpp"
#include "HMMSegment.hpp"
#include "limonp/StringUtil.hpp"
#include "PosTagger.hpp"
namespace cppjieba {
class MixSegment: public SegmentTagged {
public:
MixSegment(const string& mpSegDict, const string& hmmSegDict,
const string& userDict = "")
: mpSeg_(mpSegDict, userDict),
hmmSeg_(hmmSegDict) {
}
MixSegment(const DictTrie* dictTrie, const HMMModel* model)
: mpSeg_(dictTrie), hmmSeg_(model) {
}
~MixSegment() {
}
void Cut(const string& sentence, vector<string>& words) const {
Cut(sentence, words, true);
}
void Cut(const string& sentence, vector<string>& words, bool hmm) const {
vector<Word> tmp;
Cut(sentence, tmp, hmm);
GetStringsFromWords(tmp, words);
}
void Cut(const string& sentence, vector<Word>& words, bool hmm = true) const {
PreFilter pre_filter(symbols_, sentence);
PreFilter::Range range;
vector<WordRange> wrs;
wrs.reserve(sentence.size() / 2);
while (pre_filter.HasNext()) {
range = pre_filter.Next();
Cut(range.begin, range.end, wrs, hmm);
}
words.clear();
words.reserve(wrs.size());
GetWordsFromWordRanges(sentence, wrs, words);
}
void Cut(RuneStrArray::const_iterator begin, RuneStrArray::const_iterator end, vector<WordRange>& res, bool hmm) const {
if (!hmm) {
mpSeg_.Cut(begin, end, res);
return;
}
vector<WordRange> words;
assert(end >= begin);
words.reserve(end - begin);
mpSeg_.Cut(begin, end, words);
vector<WordRange> hmmRes;
hmmRes.reserve(end - begin);
for (size_t i = 0; i < words.size(); i++) {
//if mp Get a word, it's ok, put it into result
if (words[i].left != words[i].right || (words[i].left == words[i].right && mpSeg_.IsUserDictSingleChineseWord(words[i].left->rune))) {
res.push_back(words[i]);
continue;
}
// if mp Get a single one and it is not in userdict, collect it in sequence
size_t j = i;
while (j < words.size() && words[j].left == words[j].right && !mpSeg_.IsUserDictSingleChineseWord(words[j].left->rune)) {
j++;
}
// Cut the sequence with hmm
assert(j - 1 >= i);
// TODO
hmmSeg_.Cut(words[i].left, words[j - 1].left + 1, hmmRes);
//put hmm result to result
for (size_t k = 0; k < hmmRes.size(); k++) {
res.push_back(hmmRes[k]);
}
//clear tmp vars
hmmRes.clear();
//let i jump over this piece
i = j - 1;
}
}
const DictTrie* GetDictTrie() const {
return mpSeg_.GetDictTrie();
}
bool Tag(const string& src, vector<pair<string, string> >& res) const {
return tagger_.Tag(src, res, *this);
}
string LookupTag(const string &str) const {
return tagger_.LookupTag(str, *this);
}
private:
MPSegment mpSeg_;
HMMSegment hmmSeg_;
PosTagger tagger_;
}; // class MixSegment
} // namespace cppjieba
#endif

View File

@ -0,0 +1,77 @@
#ifndef CPPJIEBA_POS_TAGGING_H
#define CPPJIEBA_POS_TAGGING_H
#include "limonp/StringUtil.hpp"
#include "SegmentTagged.hpp"
#include "DictTrie.hpp"
namespace cppjieba {
using namespace limonp;
static const char* const POS_M = "m";
static const char* const POS_ENG = "eng";
static const char* const POS_X = "x";
class PosTagger {
public:
PosTagger() {
}
~PosTagger() {
}
bool Tag(const string& src, vector<pair<string, string> >& res, const SegmentTagged& segment) const {
vector<string> CutRes;
segment.Cut(src, CutRes);
for (vector<string>::iterator itr = CutRes.begin(); itr != CutRes.end(); ++itr) {
res.push_back(make_pair(*itr, LookupTag(*itr, segment)));
}
return !res.empty();
}
string LookupTag(const string &str, const SegmentTagged& segment) const {
const DictUnit *tmp = NULL;
RuneStrArray runes;
const DictTrie * dict = segment.GetDictTrie();
assert(dict != NULL);
if (!DecodeUTF8RunesInString(str, runes)) {
XLOG(ERROR) << "UTF-8 decode failed for word: " << str;
return POS_X;
}
tmp = dict->Find(runes.begin(), runes.end());
if (tmp == NULL || tmp->tag.empty()) {
return SpecialRule(runes);
} else {
return tmp->tag;
}
}
private:
const char* SpecialRule(const RuneStrArray& unicode) const {
size_t m = 0;
size_t eng = 0;
for (size_t i = 0; i < unicode.size() && eng < unicode.size() / 2; i++) {
if (unicode[i].rune < 0x80) {
eng ++;
if ('0' <= unicode[i].rune && unicode[i].rune <= '9') {
m++;
}
}
}
// ascii char is not found
if (eng == 0) {
return POS_X;
}
// all the ascii is number char
if (m == eng) {
return POS_M;
}
// the ascii chars contain english letter
return POS_ENG;
}
}; // class PosTagger
} // namespace cppjieba
#endif

View File

@ -0,0 +1,54 @@
#ifndef CPPJIEBA_PRE_FILTER_H
#define CPPJIEBA_PRE_FILTER_H
#include "Trie.hpp"
#include "limonp/Logging.hpp"
namespace cppjieba {
class PreFilter {
public:
//TODO use WordRange instead of Range
struct Range {
RuneStrArray::const_iterator begin;
RuneStrArray::const_iterator end;
}; // struct Range
PreFilter(const unordered_set<Rune>& symbols,
const string& sentence)
: symbols_(symbols) {
if (!DecodeUTF8RunesInString(sentence, sentence_)) {
XLOG(ERROR) << "UTF-8 decode failed for input sentence";
}
cursor_ = sentence_.begin();
}
~PreFilter() {
}
bool HasNext() const {
return cursor_ != sentence_.end();
}
Range Next() {
Range range;
range.begin = cursor_;
while (cursor_ != sentence_.end()) {
if (IsIn(symbols_, cursor_->rune)) {
if (range.begin == cursor_) {
cursor_ ++;
}
range.end = cursor_;
return range;
}
cursor_ ++;
}
range.end = sentence_.end();
return range;
}
private:
RuneStrArray::const_iterator cursor_;
RuneStrArray sentence_;
const unordered_set<Rune>& symbols_;
}; // class PreFilter
} // namespace cppjieba
#endif // CPPJIEBA_PRE_FILTER_H

View File

@ -0,0 +1,89 @@
#ifndef CPPJIEBA_QUERYSEGMENT_H
#define CPPJIEBA_QUERYSEGMENT_H
#include <algorithm>
#include <set>
#include <cassert>
#include "limonp/Logging.hpp"
#include "DictTrie.hpp"
#include "SegmentBase.hpp"
#include "FullSegment.hpp"
#include "MixSegment.hpp"
#include "Unicode.hpp"
namespace cppjieba {
class QuerySegment: public SegmentBase {
public:
QuerySegment(const string& dict, const string& model, const string& userDict = "")
: mixSeg_(dict, model, userDict),
trie_(mixSeg_.GetDictTrie()) {
}
QuerySegment(const DictTrie* dictTrie, const HMMModel* model)
: mixSeg_(dictTrie, model), trie_(dictTrie) {
}
~QuerySegment() {
}
void Cut(const string& sentence, vector<string>& words) const {
Cut(sentence, words, true);
}
void Cut(const string& sentence, vector<string>& words, bool hmm) const {
vector<Word> tmp;
Cut(sentence, tmp, hmm);
GetStringsFromWords(tmp, words);
}
void Cut(const string& sentence, vector<Word>& words, bool hmm = true) const {
PreFilter pre_filter(symbols_, sentence);
PreFilter::Range range;
vector<WordRange> wrs;
wrs.reserve(sentence.size()/2);
while (pre_filter.HasNext()) {
range = pre_filter.Next();
Cut(range.begin, range.end, wrs, hmm);
}
words.clear();
words.reserve(wrs.size());
GetWordsFromWordRanges(sentence, wrs, words);
}
void Cut(RuneStrArray::const_iterator begin, RuneStrArray::const_iterator end, vector<WordRange>& res, bool hmm) const {
//use mix Cut first
vector<WordRange> mixRes;
mixSeg_.Cut(begin, end, mixRes, hmm);
vector<WordRange> fullRes;
for (vector<WordRange>::const_iterator mixResItr = mixRes.begin(); mixResItr != mixRes.end(); mixResItr++) {
if (mixResItr->Length() > 2) {
for (size_t i = 0; i + 1 < mixResItr->Length(); i++) {
WordRange wr(mixResItr->left + i, mixResItr->left + i + 1);
if (trie_->Find(wr.left, wr.right + 1) != NULL) {
res.push_back(wr);
}
}
}
if (mixResItr->Length() > 3) {
for (size_t i = 0; i + 2 < mixResItr->Length(); i++) {
WordRange wr(mixResItr->left + i, mixResItr->left + i + 2);
if (trie_->Find(wr.left, wr.right + 1) != NULL) {
res.push_back(wr);
}
}
}
res.push_back(*mixResItr);
}
}
private:
bool IsAllAscii(const Unicode& s) const {
for(size_t i = 0; i < s.size(); i++) {
if (s[i] >= 0x80) {
return false;
}
}
return true;
}
MixSegment mixSeg_;
const DictTrie* trie_;
}; // QuerySegment
} // namespace cppjieba
#endif

View File

@ -0,0 +1,46 @@
#ifndef CPPJIEBA_SEGMENTBASE_H
#define CPPJIEBA_SEGMENTBASE_H
#include "limonp/Logging.hpp"
#include "PreFilter.hpp"
#include <cassert>
namespace cppjieba {
const char* const SPECIAL_SEPARATORS = " \t\n\xEF\xBC\x8C\xE3\x80\x82";
using namespace limonp;
class SegmentBase {
public:
SegmentBase() {
XCHECK(ResetSeparators(SPECIAL_SEPARATORS));
}
virtual ~SegmentBase() {
}
virtual void Cut(const string& sentence, vector<string>& words) const = 0;
bool ResetSeparators(const string& s) {
symbols_.clear();
RuneStrArray runes;
if (!DecodeUTF8RunesInString(s, runes)) {
XLOG(ERROR) << "UTF-8 decode failed for separators: " << s;
return false;
}
for (size_t i = 0; i < runes.size(); i++) {
if (!symbols_.insert(runes[i].rune).second) {
XLOG(ERROR) << s.substr(runes[i].offset, runes[i].len) << " already exists";
return false;
}
}
return true;
}
protected:
unordered_set<Rune> symbols_;
}; // class SegmentBase
} // cppjieba
#endif

View File

@ -0,0 +1,23 @@
#ifndef CPPJIEBA_SEGMENTTAGGED_H
#define CPPJIEBA_SEGMENTTAGGED_H
#include "SegmentBase.hpp"
namespace cppjieba {
class SegmentTagged : public SegmentBase{
public:
SegmentTagged() {
}
virtual ~SegmentTagged() {
}
virtual bool Tag(const string& src, vector<pair<string, string> >& res) const = 0;
virtual const DictTrie* GetDictTrie() const = 0;
}; // class SegmentTagged
} // cppjieba
#endif

View File

@ -0,0 +1,190 @@
#ifndef CPPJIEBA_TEXTRANK_EXTRACTOR_H
#define CPPJIEBA_TEXTRANK_EXTRACTOR_H
#include <cmath>
#include "Jieba.hpp"
namespace cppjieba {
using namespace limonp;
using namespace std;
class TextRankExtractor {
public:
typedef struct _Word {string word;vector<size_t> offsets;double weight;} Word; // struct Word
private:
typedef std::map<string,Word> WordMap;
class WordGraph{
private:
typedef double Score;
typedef string Node;
typedef std::set<Node> NodeSet;
typedef std::map<Node,double> Edges;
typedef std::map<Node,Edges> Graph;
//typedef std::unordered_map<Node,double> Edges;
//typedef std::unordered_map<Node,Edges> Graph;
double d;
Graph graph;
NodeSet nodeSet;
public:
WordGraph(): d(0.85) {};
WordGraph(double in_d): d(in_d) {};
void addEdge(Node start,Node end,double weight){
Edges temp;
Edges::iterator gotEdges;
nodeSet.insert(start);
nodeSet.insert(end);
graph[start][end]+=weight;
graph[end][start]+=weight;
}
void rank(WordMap &ws,size_t rankTime=10){
WordMap outSum;
Score wsdef, min_rank, max_rank;
if( graph.size() == 0)
return;
wsdef = 1.0 / graph.size();
for(Graph::iterator edges=graph.begin();edges!=graph.end();++edges){
// edges->first start节点edge->first end节点edge->second 权重
ws[edges->first].word=edges->first;
ws[edges->first].weight=wsdef;
outSum[edges->first].weight=0;
for(Edges::iterator edge=edges->second.begin();edge!=edges->second.end();++edge){
outSum[edges->first].weight+=edge->second;
}
}
//sort(nodeSet.begin(),nodeSet.end()); 是否需要排序?
for( size_t i=0; i<rankTime; i++ ){
for(NodeSet::iterator node = nodeSet.begin(); node != nodeSet.end(); node++ ){
double s = 0;
for( Edges::iterator edge= graph[*node].begin(); edge != graph[*node].end(); edge++ )
// edge->first end节点edge->second 权重
s += edge->second / outSum[edge->first].weight * ws[edge->first].weight;
ws[*node].weight = (1 - d) + d * s;
}
}
min_rank=max_rank=ws.begin()->second.weight;
for(WordMap::iterator i = ws.begin(); i != ws.end(); i ++){
if( i->second.weight < min_rank ){
min_rank = i->second.weight;
}
if( i->second.weight > max_rank ){
max_rank = i->second.weight;
}
}
for(WordMap::iterator i = ws.begin(); i != ws.end(); i ++){
ws[i->first].weight = (i->second.weight - min_rank / 10.0) / (max_rank - min_rank / 10.0);
}
}
};
public:
TextRankExtractor(const string& dictPath,
const string& hmmFilePath,
const string& stopWordPath,
const string& userDict = "")
: segment_(dictPath, hmmFilePath, userDict) {
LoadStopWordDict(stopWordPath);
}
TextRankExtractor(const DictTrie* dictTrie,
const HMMModel* model,
const string& stopWordPath)
: segment_(dictTrie, model) {
LoadStopWordDict(stopWordPath);
}
TextRankExtractor(const Jieba& jieba, const string& stopWordPath) : segment_(jieba.GetDictTrie(), jieba.GetHMMModel()) {
LoadStopWordDict(stopWordPath);
}
~TextRankExtractor() {
}
void Extract(const string& sentence, vector<string>& keywords, size_t topN) const {
vector<Word> topWords;
Extract(sentence, topWords, topN);
for (size_t i = 0; i < topWords.size(); i++) {
keywords.push_back(topWords[i].word);
}
}
void Extract(const string& sentence, vector<pair<string, double> >& keywords, size_t topN) const {
vector<Word> topWords;
Extract(sentence, topWords, topN);
for (size_t i = 0; i < topWords.size(); i++) {
keywords.push_back(pair<string, double>(topWords[i].word, topWords[i].weight));
}
}
void Extract(const string& sentence, vector<Word>& keywords, size_t topN, size_t span=5,size_t rankTime=10) const {
vector<string> words;
segment_.Cut(sentence, words);
TextRankExtractor::WordGraph graph;
WordMap wordmap;
size_t offset = 0;
for(size_t i=0; i < words.size(); i++){
size_t t = offset;
offset += words[i].size();
if (IsSingleWord(words[i]) || stopWords_.find(words[i]) != stopWords_.end()) {
continue;
}
for(size_t j=i+1,skip=0;j<i+span+skip && j<words.size();j++){
if (IsSingleWord(words[j]) || stopWords_.find(words[j]) != stopWords_.end()) {
skip++;
continue;
}
graph.addEdge(words[i],words[j],1);
}
wordmap[words[i]].offsets.push_back(t);
}
if (offset != sentence.size()) {
XLOG(ERROR) << "words illegal";
return;
}
graph.rank(wordmap,rankTime);
keywords.clear();
keywords.reserve(wordmap.size());
for (WordMap::iterator itr = wordmap.begin(); itr != wordmap.end(); ++itr) {
keywords.push_back(itr->second);
}
topN = min(topN, keywords.size());
partial_sort(keywords.begin(), keywords.begin() + topN, keywords.end(), Compare);
keywords.resize(topN);
}
private:
void LoadStopWordDict(const string& filePath) {
ifstream ifs(filePath.c_str());
XCHECK(ifs.is_open()) << "open " << filePath << " failed";
string line ;
while (getline(ifs, line)) {
stopWords_.insert(line);
}
assert(stopWords_.size());
}
static bool Compare(const Word &x,const Word &y){
return x.weight > y.weight;
}
MixSegment segment_;
unordered_set<string> stopWords_;
}; // class TextRankExtractor
inline ostream& operator << (ostream& os, const TextRankExtractor::Word& word) {
return os << "{\"word\": \"" << word.word << "\", \"offset\": " << word.offsets << ", \"weight\": " << word.weight << "}";
}
} // namespace cppjieba
#endif

200
include/cppjieba/Trie.hpp Normal file
View File

@ -0,0 +1,200 @@
#ifndef CPPJIEBA_TRIE_HPP
#define CPPJIEBA_TRIE_HPP
#include <vector>
#include <queue>
#include "limonp/StdExtension.hpp"
#include "Unicode.hpp"
namespace cppjieba {
using namespace std;
const size_t MAX_WORD_LENGTH = 512;
struct DictUnit {
Unicode word;
double weight;
string tag;
}; // struct DictUnit
// for debugging
// inline ostream & operator << (ostream& os, const DictUnit& unit) {
// string s;
// s << unit.word;
// return os << StringFormat("%s %s %.3lf", s.c_str(), unit.tag.c_str(), unit.weight);
// }
struct Dag {
RuneStr runestr;
// [offset, nexts.first]
limonp::LocalVector<pair<size_t, const DictUnit*> > nexts;
const DictUnit * pInfo;
double weight;
size_t nextPos; // TODO
Dag():runestr(), pInfo(NULL), weight(0.0), nextPos(0) {
}
}; // struct Dag
typedef Rune TrieKey;
class TrieNode {
public :
TrieNode(): next(NULL), ptValue(NULL) {
}
public:
typedef unordered_map<TrieKey, TrieNode*> NextMap;
NextMap *next;
const DictUnit *ptValue;
};
class Trie {
public:
Trie(const vector<Unicode>& keys, const vector<const DictUnit*>& valuePointers)
: root_(new TrieNode) {
CreateTrie(keys, valuePointers);
}
~Trie() {
DeleteNode(root_);
}
const DictUnit* Find(RuneStrArray::const_iterator begin, RuneStrArray::const_iterator end) const {
if (begin == end) {
return NULL;
}
const TrieNode* ptNode = root_;
TrieNode::NextMap::const_iterator citer;
for (RuneStrArray::const_iterator it = begin; it != end; it++) {
if (NULL == ptNode->next) {
return NULL;
}
citer = ptNode->next->find(it->rune);
if (ptNode->next->end() == citer) {
return NULL;
}
ptNode = citer->second;
}
return ptNode->ptValue;
}
void Find(RuneStrArray::const_iterator begin,
RuneStrArray::const_iterator end,
vector<struct Dag>&res,
size_t max_word_len = MAX_WORD_LENGTH) const {
assert(root_ != NULL);
res.resize(end - begin);
const TrieNode *ptNode = NULL;
TrieNode::NextMap::const_iterator citer;
for (size_t i = 0; i < size_t(end - begin); i++) {
res[i].runestr = *(begin + i);
if (root_->next != NULL && root_->next->end() != (citer = root_->next->find(res[i].runestr.rune))) {
ptNode = citer->second;
} else {
ptNode = NULL;
}
if (ptNode != NULL) {
res[i].nexts.push_back(pair<size_t, const DictUnit*>(i, ptNode->ptValue));
} else {
res[i].nexts.push_back(pair<size_t, const DictUnit*>(i, static_cast<const DictUnit*>(NULL)));
}
for (size_t j = i + 1; j < size_t(end - begin) && (j - i + 1) <= max_word_len; j++) {
if (ptNode == NULL || ptNode->next == NULL) {
break;
}
citer = ptNode->next->find((begin + j)->rune);
if (ptNode->next->end() == citer) {
break;
}
ptNode = citer->second;
if (NULL != ptNode->ptValue) {
res[i].nexts.push_back(pair<size_t, const DictUnit*>(j, ptNode->ptValue));
}
}
}
}
void InsertNode(const Unicode& key, const DictUnit* ptValue) {
if (key.begin() == key.end()) {
return;
}
TrieNode::NextMap::const_iterator kmIter;
TrieNode *ptNode = root_;
for (Unicode::const_iterator citer = key.begin(); citer != key.end(); ++citer) {
if (NULL == ptNode->next) {
ptNode->next = new TrieNode::NextMap;
}
kmIter = ptNode->next->find(*citer);
if (ptNode->next->end() == kmIter) {
TrieNode *nextNode = new TrieNode;
ptNode->next->insert(make_pair(*citer, nextNode));
ptNode = nextNode;
} else {
ptNode = kmIter->second;
}
}
assert(ptNode != NULL);
ptNode->ptValue = ptValue;
}
void DeleteNode(const Unicode& key, const DictUnit* ptValue) {
if (key.begin() == key.end()) {
return;
}
//定义一个NextMap迭代器
TrieNode::NextMap::const_iterator kmIter;
//定义一个指向root的TrieNode指针
TrieNode *ptNode = root_;
for (Unicode::const_iterator citer = key.begin(); citer != key.end(); ++citer) {
//链表不存在元素
if (NULL == ptNode->next) {
return;
}
kmIter = ptNode->next->find(*citer);
//如果map中不存在,跳出循环
if (ptNode->next->end() == kmIter) {
break;
}
//从unordered_map中擦除该项
ptNode->next->erase(*citer);
//删除该node
ptNode = kmIter->second;
delete ptNode;
break;
}
return;
}
private:
void CreateTrie(const vector<Unicode>& keys, const vector<const DictUnit*>& valuePointers) {
if (valuePointers.empty() || keys.empty()) {
return;
}
assert(keys.size() == valuePointers.size());
for (size_t i = 0; i < keys.size(); i++) {
InsertNode(keys[i], valuePointers[i]);
}
}
void DeleteNode(TrieNode* node) {
if (NULL == node) {
return;
}
if (NULL != node->next) {
for (TrieNode::NextMap::iterator it = node->next->begin(); it != node->next->end(); ++it) {
DeleteNode(it->second);
}
delete node->next;
}
delete node;
}
TrieNode* root_;
}; // class Trie
} // namespace cppjieba
#endif // CPPJIEBA_TRIE_HPP

View File

@ -0,0 +1,227 @@
#ifndef CPPJIEBA_UNICODE_H
#define CPPJIEBA_UNICODE_H
#include <stdint.h>
#include <stdlib.h>
#include <string>
#include <vector>
#include <ostream>
#include "limonp/LocalVector.hpp"
namespace cppjieba {
using std::string;
using std::vector;
typedef uint32_t Rune;
struct Word {
string word;
uint32_t offset;
uint32_t unicode_offset;
uint32_t unicode_length;
Word(const string& w, uint32_t o)
: word(w), offset(o) {
}
Word(const string& w, uint32_t o, uint32_t unicode_offset, uint32_t unicode_length)
: word(w), offset(o), unicode_offset(unicode_offset), unicode_length(unicode_length) {
}
}; // struct Word
inline std::ostream& operator << (std::ostream& os, const Word& w) {
return os << "{\"word\": \"" << w.word << "\", \"offset\": " << w.offset << "}";
}
struct RuneStr {
Rune rune;
uint32_t offset;
uint32_t len;
uint32_t unicode_offset;
uint32_t unicode_length;
RuneStr(): rune(0), offset(0), len(0), unicode_offset(0), unicode_length(0) {
}
RuneStr(Rune r, uint32_t o, uint32_t l)
: rune(r), offset(o), len(l), unicode_offset(0), unicode_length(0) {
}
RuneStr(Rune r, uint32_t o, uint32_t l, uint32_t unicode_offset, uint32_t unicode_length)
: rune(r), offset(o), len(l), unicode_offset(unicode_offset), unicode_length(unicode_length) {
}
}; // struct RuneStr
inline std::ostream& operator << (std::ostream& os, const RuneStr& r) {
return os << "{\"rune\": \"" << r.rune << "\", \"offset\": " << r.offset << ", \"len\": " << r.len << "}";
}
typedef limonp::LocalVector<Rune> Unicode;
typedef limonp::LocalVector<struct RuneStr> RuneStrArray;
// [left, right]
struct WordRange {
RuneStrArray::const_iterator left;
RuneStrArray::const_iterator right;
WordRange(RuneStrArray::const_iterator l, RuneStrArray::const_iterator r)
: left(l), right(r) {
}
size_t Length() const {
return right - left + 1;
}
bool IsAllAscii() const {
for (RuneStrArray::const_iterator iter = left; iter <= right; ++iter) {
if (iter->rune >= 0x80) {
return false;
}
}
return true;
}
}; // struct WordRange
struct RuneStrLite {
uint32_t rune;
uint32_t len;
RuneStrLite(): rune(0), len(0) {
}
RuneStrLite(uint32_t r, uint32_t l): rune(r), len(l) {
}
}; // struct RuneStrLite
inline RuneStrLite DecodeUTF8ToRune(const char* str, size_t len) {
RuneStrLite rp(0, 0);
if (str == NULL || len == 0) {
return rp;
}
if (!(str[0] & 0x80)) { // 0xxxxxxx
// 7bit, total 7bit
rp.rune = (uint8_t)(str[0]) & 0x7f;
rp.len = 1;
} else if ((uint8_t)str[0] <= 0xdf && 1 < len) {
// 110xxxxxx
// 5bit, total 5bit
rp.rune = (uint8_t)(str[0]) & 0x1f;
// 6bit, total 11bit
rp.rune <<= 6;
rp.rune |= (uint8_t)(str[1]) & 0x3f;
rp.len = 2;
} else if((uint8_t)str[0] <= 0xef && 2 < len) { // 1110xxxxxx
// 4bit, total 4bit
rp.rune = (uint8_t)(str[0]) & 0x0f;
// 6bit, total 10bit
rp.rune <<= 6;
rp.rune |= (uint8_t)(str[1]) & 0x3f;
// 6bit, total 16bit
rp.rune <<= 6;
rp.rune |= (uint8_t)(str[2]) & 0x3f;
rp.len = 3;
} else if((uint8_t)str[0] <= 0xf7 && 3 < len) { // 11110xxxx
// 3bit, total 3bit
rp.rune = (uint8_t)(str[0]) & 0x07;
// 6bit, total 9bit
rp.rune <<= 6;
rp.rune |= (uint8_t)(str[1]) & 0x3f;
// 6bit, total 15bit
rp.rune <<= 6;
rp.rune |= (uint8_t)(str[2]) & 0x3f;
// 6bit, total 21bit
rp.rune <<= 6;
rp.rune |= (uint8_t)(str[3]) & 0x3f;
rp.len = 4;
} else {
rp.rune = 0;
rp.len = 0;
}
return rp;
}
inline bool DecodeUTF8RunesInString(const char* s, size_t len, RuneStrArray& runes) {
runes.clear();
runes.reserve(len / 2);
for (uint32_t i = 0, j = 0; i < len;) {
RuneStrLite rp = DecodeUTF8ToRune(s + i, len - i);
if (rp.len == 0) {
runes.clear();
return false;
}
RuneStr x(rp.rune, i, rp.len, j, 1);
runes.push_back(x);
i += rp.len;
++j;
}
return true;
}
inline bool DecodeUTF8RunesInString(const string& s, RuneStrArray& runes) {
return DecodeUTF8RunesInString(s.c_str(), s.size(), runes);
}
inline bool DecodeUTF8RunesInString(const char* s, size_t len, Unicode& unicode) {
unicode.clear();
RuneStrArray runes;
if (!DecodeUTF8RunesInString(s, len, runes)) {
return false;
}
unicode.reserve(runes.size());
for (size_t i = 0; i < runes.size(); i++) {
unicode.push_back(runes[i].rune);
}
return true;
}
inline bool IsSingleWord(const string& str) {
RuneStrLite rp = DecodeUTF8ToRune(str.c_str(), str.size());
return rp.len == str.size();
}
inline bool DecodeUTF8RunesInString(const string& s, Unicode& unicode) {
return DecodeUTF8RunesInString(s.c_str(), s.size(), unicode);
}
inline Unicode DecodeUTF8RunesInString(const string& s) {
Unicode result;
DecodeUTF8RunesInString(s, result);
return result;
}
// [left, right]
inline Word GetWordFromRunes(const string& s, RuneStrArray::const_iterator left, RuneStrArray::const_iterator right) {
assert(right->offset >= left->offset);
uint32_t len = right->offset - left->offset + right->len;
uint32_t unicode_length = right->unicode_offset - left->unicode_offset + right->unicode_length;
return Word(s.substr(left->offset, len), left->offset, left->unicode_offset, unicode_length);
}
inline string GetStringFromRunes(const string& s, RuneStrArray::const_iterator left, RuneStrArray::const_iterator right) {
assert(right->offset >= left->offset);
uint32_t len = right->offset - left->offset + right->len;
return s.substr(left->offset, len);
}
inline void GetWordsFromWordRanges(const string& s, const vector<WordRange>& wrs, vector<Word>& words) {
for (size_t i = 0; i < wrs.size(); i++) {
words.push_back(GetWordFromRunes(s, wrs[i].left, wrs[i].right));
}
}
inline vector<Word> GetWordsFromWordRanges(const string& s, const vector<WordRange>& wrs) {
vector<Word> result;
GetWordsFromWordRanges(s, wrs, result);
return result;
}
inline void GetStringsFromWords(const vector<Word>& words, vector<string>& strs) {
strs.resize(words.size());
for (size_t i = 0; i < words.size(); ++i) {
strs[i] = words[i].word;
}
}
} // namespace cppjieba
#endif // CPPJIEBA_UNICODE_H

View File

@ -1,2 +0,0 @@
INSTALL(PROGRAMS start.sh stop.sh DESTINATION /etc/init.d/CppJieba)
INSTALL(PROGRAMS cjseg.sh DESTINATION bin)

View File

@ -1,5 +0,0 @@
if [ $# -lt 1 ]; then
echo "usage: $0 <file>"
exit 1
fi
cjsegment --dictpath /usr/share/CppJieba/dicts/jieba.dict.utf8 --modelpath /usr/share/CppJieba/dicts/hmm_model.utf8 $1

View File

@ -1,7 +0,0 @@
#!/bin/sh
cjserver -c /etc/CppJieba/server.conf -k start >> /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "start failed."
exit 1
fi
echo "start ok."

View File

@ -1,7 +0,0 @@
#!/bin/sh
cjserver -c /etc/CppJieba/server.conf -k stop
if [ $? -ne 0 ]; then
echo "stop failed."
exit 1
fi
echo "stop ok."

View File

@ -1,23 +0,0 @@
SET(EXECUTABLE_OUTPUT_PATH ${PROJECT_BINARY_DIR}/bin)
SET(LIBRARY_OUTPUT_PATH ${PROJECT_BINARY_DIR}/lib)
SET(LIBCPPJIEBA_SRC HMMSegment.cpp MixSegment.cpp MPSegment.cpp Trie.cpp)
ADD_LIBRARY(cppjieba STATIC ${LIBCPPJIEBA_SRC})
ADD_EXECUTABLE(cjsegment segment.cpp)
ADD_EXECUTABLE(cjserver server.cpp)
LINK_DIRECTORIES(Husky)
TARGET_LINK_LIBRARIES(cjsegment cppjieba)
TARGET_LINK_LIBRARIES(cjserver cppjieba husky pthread)
SET_TARGET_PROPERTIES(cppjieba PROPERTIES VERSION 1.2 SOVERSION 1)
INSTALL(TARGETS cjsegment RUNTIME DESTINATION bin)
INSTALL(TARGETS cjserver RUNTIME DESTINATION bin)
INSTALL(TARGETS cppjieba ARCHIVE DESTINATION lib/CppJieba)
INSTALL(FILES ChineseFilter.hpp HMMSegment.h MPSegment.h structs.h Trie.h globals.h ISegment.hpp MixSegment.h SegmentBase.hpp TransCode.hpp DESTINATION include/CppJieba)
ADD_SUBDIRECTORY(Husky)
ADD_SUBDIRECTORY(Limonp)

View File

@ -1,107 +0,0 @@
#ifndef CPPJIEBA_CHINESEFILTER_H
#define CPPJIEBA_CHINESEFILTER_H
#include "globals.h"
#include "TransCode.hpp"
namespace CppJieba
{
class ChineseFilter;
class ChFilterIterator
{
public:
const Unicode * ptUnico;
UniConIter begin;
UniConIter end;
CHAR_TYPE charType;
ChFilterIterator& operator++()
{
return *this = _get(end);
}
ChFilterIterator operator++(int)
{
ChFilterIterator res = *this;
*this = _get(end);
return res;
}
bool operator==(const ChFilterIterator& iter)
{
return begin == iter.begin && end == iter.end;
}
bool operator!=(const ChFilterIterator& iter)
{
return !(*this == iter);
}
ChFilterIterator& operator=(const ChFilterIterator& iter)
{
ptUnico = iter.ptUnico;
begin = iter.begin;
end = iter.end;
charType = iter.charType;
return *this;
}
public:
ChFilterIterator(const Unicode * ptu, UniConIter be, UniConIter en, CHAR_TYPE is):ptUnico(ptu), begin(be), end(en), charType(is){};
ChFilterIterator(const Unicode * ptu):ptUnico(ptu){*this = _get(ptUnico->begin());};
private:
ChFilterIterator(){}
private:
CHAR_TYPE _charType(uint16_t x)const
{
if((0x0030 <= x && x<= 0x0039) || (0x0041 <= x && x <= 0x005a ) || (0x0061 <= x && x <= 0x007a))
{
return DIGIT_OR_LETTER;
}
if(x >= 0x4e00 && x <= 0x9fff)
{
return CHWORD;
}
return OTHERS;
}
ChFilterIterator _get(UniConIter iter)
{
UniConIter _begin = iter;
const UniConIter& _end = ptUnico->end();
if(iter == _end)
{
return ChFilterIterator(ptUnico, end, end, OTHERS);
}
CHAR_TYPE charType = _charType(*iter);
iter ++;
while(iter != _end &&charType == _charType(*iter))
{
iter++;
}
return ChFilterIterator(ptUnico, _begin, iter, charType);
}
};
class ChineseFilter
{
private:
Unicode _unico;
public:
typedef ChFilterIterator iterator;
public:
ChineseFilter(){};
~ChineseFilter(){};
public:
bool feed(const string& str)
{
return TransCode::decode(str, _unico);
}
iterator begin()
{
return iterator(&_unico);
}
iterator end()
{
return iterator(&_unico, _unico.end(), _unico.end(), OTHERS);
}
};
}
#endif

View File

@ -1,341 +0,0 @@
#include "HMMSegment.h"
namespace CppJieba
{
HMMSegment::HMMSegment()
{
memset(_startProb, 0, sizeof(_startProb));
memset(_transProb, 0, sizeof(_transProb));
_statMap[0] = 'B';
_statMap[1] = 'E';
_statMap[2] = 'M';
_statMap[3] = 'S';
_emitProbVec.push_back(&_emitProbB);
_emitProbVec.push_back(&_emitProbE);
_emitProbVec.push_back(&_emitProbM);
_emitProbVec.push_back(&_emitProbS);
}
HMMSegment::~HMMSegment()
{
dispose();
}
bool HMMSegment::init(const char* const modelPath)
{
return _setInitFlag(_loadModel(modelPath));
}
bool HMMSegment::dispose()
{
_setInitFlag(false);
return true;
}
bool HMMSegment::_loadModel(const char* const filePath)
{
LogInfo("loadModel [%s] start ...", filePath);
ifstream ifile(filePath);
string line;
vector<string> tmp;
vector<string> tmp2;
//load _startProb
if(!_getLine(ifile, line))
{
return false;
}
splitStr(line, tmp, " ");
if(tmp.size() != STATUS_SUM)
{
LogError("start_p illegal");
return false;
}
for(uint j = 0; j< tmp.size(); j++)
{
_startProb[j] = atof(tmp[j].c_str());
//cout<<_startProb[j]<<endl;
}
//load _transProb
for(uint i = 0; i < STATUS_SUM; i++)
{
if(!_getLine(ifile, line))
{
return false;
}
splitStr(line, tmp, " ");
if(tmp.size() != STATUS_SUM)
{
LogError("trans_p illegal");
return false;
}
for(uint j =0; j < STATUS_SUM; j++)
{
_transProb[i][j] = atof(tmp[j].c_str());
//cout<<_transProb[i][j]<<endl;
}
}
//load _emitProbB
if(!_getLine(ifile, line) || !_loadEmitProb(line, _emitProbB))
{
return false;
}
//load _emitProbE
if(!_getLine(ifile, line) || !_loadEmitProb(line, _emitProbE))
{
return false;
}
//load _emitProbM
if(!_getLine(ifile, line) || !_loadEmitProb(line, _emitProbM))
{
return false;
}
//load _emitProbS
if(!_getLine(ifile, line) || !_loadEmitProb(line, _emitProbS))
{
return false;
}
LogInfo("loadModel [%s] end.", filePath);
return true;
}
bool HMMSegment::cut(Unicode::const_iterator begin, Unicode::const_iterator end, vector<Unicode>& res)const
{
if(!_getInitFlag())
{
LogError("not inited.");
return false;
}
vector<uint> status;
if(!_viterbi(begin, end, status))
{
LogError("_viterbi failed.");
return false;
}
Unicode::const_iterator left = begin;
Unicode::const_iterator right;
for(uint i =0; i< status.size(); i++)
{
if(status[i] % 2) //if(E == status[i] || S == status[i])
{
right = begin + i + 1;
res.push_back(Unicode(left, right));
left = right;
}
}
return true;
}
bool HMMSegment::cut(const string& str, vector<string>& res)const
{
return SegmentBase::cut(str, res);
}
bool HMMSegment::cut(Unicode::const_iterator begin, Unicode::const_iterator end, vector<string>& res) const
{
if(!_getInitFlag())
{
LogError("not inited.");
return false;
}
if(begin == end)
{
return false;
}
vector<Unicode> words;
if(!cut(begin, end, words))
{
return false;
}
string tmp;
for(uint i = 0; i < words.size(); i++)
{
if(TransCode::encode(words[i], tmp))
{
res.push_back(tmp);
}
}
return true;
}
bool HMMSegment::_viterbi(Unicode::const_iterator begin, Unicode::const_iterator end, vector<uint>& status)const
{
if(begin == end)
{
return false;
}
size_t Y = STATUS_SUM;
size_t X = end - begin;
size_t XYSize = X * Y;
int * path;
double * weight;
uint now, old, stat;
double tmp, endE, endS;
try
{
path = new int [XYSize];
weight = new double [XYSize];
}
catch(const std::bad_alloc&)
{
LogError("bad_alloc");
return false;
}
if(NULL == path || NULL == weight)
{
LogError("bad_alloc");
return false;
}
//start
for(uint y = 0; y < Y; y++)
{
weight[0 + y * X] = _startProb[y] + _getEmitProb(_emitProbVec[y], *begin, MIN_DOUBLE);
path[0 + y * X] = -1;
}
//process
//for(; begin != end; begin++)
for(uint x = 1; x < X; x++)
{
for(uint y = 0; y < Y; y++)
{
now = x + y*X;
weight[now] = MIN_DOUBLE;
path[now] = E; // warning
for(uint preY = 0; preY < Y; preY++)
{
old = x - 1 + preY * X;
tmp = weight[old] + _transProb[preY][y] + _getEmitProb(_emitProbVec[y], *(begin+x), MIN_DOUBLE);
if(tmp > weight[now])
{
weight[now] = tmp;
path[now] = preY;
}
}
}
}
endE = weight[X-1+E*X];
endS = weight[X-1+S*X];
stat = 0;
if(endE > endS)
{
stat = E;
}
else
{
stat = S;
}
status.assign(X, 0);
for(int x = X -1 ; x >= 0; x--)
{
status[x] = stat;
stat = path[x + stat*X];
}
delete [] path;
delete [] weight;
return true;
}
bool HMMSegment::_getLine(ifstream& ifile, string& line)
{
while(getline(ifile, line))
{
trim(line);
if(line.empty())
{
continue;
}
if(strStartsWith(line, "#"))
{
continue;
}
return true;
}
return false;
}
bool HMMSegment::_loadEmitProb(const string& line, EmitProbMap& mp)
{
if(line.empty())
{
return false;
}
vector<string> tmp, tmp2;
uint16_t unico = 0;
splitStr(line, tmp, ",");
for(uint i = 0; i < tmp.size(); i++)
{
splitStr(tmp[i], tmp2, ":");
if(2 != tmp2.size())
{
LogError("_emitProb illegal.");
return false;
}
if(!_decodeOne(tmp2[0], unico))
{
LogError("TransCode failed.");
return false;
}
mp[unico] = atof(tmp2[1].c_str());
}
return true;
}
bool HMMSegment::_decodeOne(const string& str, uint16_t& res)
{
Unicode ui16;
if(!TransCode::decode(str, ui16) || ui16.size() != 1)
{
return false;
}
res = ui16[0];
return true;
}
double HMMSegment::_getEmitProb(const EmitProbMap* ptMp, uint16_t key, double defVal)const
{
EmitProbMap::const_iterator cit = ptMp->find(key);
if(cit == ptMp->end())
{
return defVal;
}
return cit->second;
}
}
#ifdef HMMSEGMENT_UT
using namespace CppJieba;
size_t add(size_t a, size_t b)
{
return a*b;
}
int main()
{
TransCode::setUtf8Enc();
HMMSegment hmm;
hmm.loadModel("../dicts/hmm_model.utf8");
vector<string> res;
hmm.cut("小明硕士毕业于北邮网络研究院。。.", res);
cout<<joinStr(res, "/")<<endl;
return 0;
}
#endif

View File

@ -1,59 +0,0 @@
#ifndef CPPJIBEA_HMMSEGMENT_H
#define CPPJIBEA_HMMSEGMENT_H
#include <iostream>
#include <fstream>
#include <memory.h>
#include "Limonp/str_functs.hpp"
#include "Limonp/logger.hpp"
#include "globals.h"
#include "TransCode.hpp"
#include "ISegment.hpp"
#include "SegmentBase.hpp"
namespace CppJieba
{
using namespace Limonp;
class HMMSegment: public SegmentBase
{
public:
/*
* STATUS:
* 0:B, 1:E, 2:M, 3:S
* */
enum {B = 0, E = 1, M = 2, S = 3, STATUS_SUM = 4};
private:
char _statMap[STATUS_SUM];
double _startProb[STATUS_SUM];
double _transProb[STATUS_SUM][STATUS_SUM];
EmitProbMap _emitProbB;
EmitProbMap _emitProbE;
EmitProbMap _emitProbM;
EmitProbMap _emitProbS;
vector<EmitProbMap* > _emitProbVec;
public:
HMMSegment();
virtual ~HMMSegment();
public:
bool init(const char* const modelPath);
bool dispose();
public:
bool cut(Unicode::const_iterator begin, Unicode::const_iterator end, vector<Unicode>& res)const ;
bool cut(const string& str, vector<string>& res)const;
bool cut(Unicode::const_iterator begin, Unicode::const_iterator end, vector<string>& res)const;
//virtual bool cut(const string& str, vector<string>& res)const;
private:
bool _viterbi(Unicode::const_iterator begin, Unicode::const_iterator end, vector<uint>& status)const;
bool _loadModel(const char* const filePath);
bool _getLine(ifstream& ifile, string& line);
bool _loadEmitProb(const string& line, EmitProbMap& mp);
bool _decodeOne(const string& str, uint16_t& res);
double _getEmitProb(const EmitProbMap* ptMp, uint16_t key, double defVal)const ;
};
}
#endif

View File

@ -1,8 +0,0 @@
SET(EXECUTABLE_OUTPUT_PATH ${PROJECT_BINARY_DIR}/bin)
SET(LIBRARY_OUTPUT_PATH ${PROJECT_BINARY_DIR}/lib)
SET(LIBHUSKY_SRC Daemon.cpp ServerFrame.cpp)
ADD_LIBRARY(husky STATIC ${LIBHUSKY_SRC})
INSTALL(TARGETS husky ARCHIVE DESTINATION lib/CppJieba/Husky)
INSTALL(FILES Daemon.h globals.h HttpReqInfo.hpp ServerFrame.h ThreadManager.hpp DESTINATION include/CppJieba/Husky)

View File

@ -1,191 +0,0 @@
#include "Daemon.h"
namespace Husky
{
IWorkHandler * Daemon::m_pHandler;
int Daemon::m_nChildPid = 0;
const char* Daemon::m_pidFile = NULL;
bool Daemon::isAbnormalExit(int pid, int status)
{
bool bRestart = true;
if (WIFEXITED(status)) //exit()or return
{
LogDebug("child normal termination, exit pid = %d, status = %d", pid, WEXITSTATUS(status));
bRestart = false;
}
else if (WIFSIGNALED(status)) //signal方式退出
{
LogError("abnormal termination, pid = %d, signal number = %d%s", pid, WTERMSIG(status),
#ifdef WCOREDUMP
WCOREDUMP(status) ? " (core file generated)" :
#endif
"");
if (WTERMSIG(status) == SIGKILL)
{
bRestart = false;
LogError("has been killed by user , exit pid = %d, status = %d", pid, WEXITSTATUS(status));
}
}
else if (WIFSTOPPED(status)) //暂停的子进程退出
{
LogError("child stopped, pid = %d, signal number = %d", pid, WSTOPSIG(status));
}
else
{
LogError("child other reason quit, pid = %d, signal number = %d", pid, WSTOPSIG(status));
}
return bRestart;
}
bool Daemon::start()
{
string masterPidStr = loadFile2Str(m_pidFile);
int masterPid = atoi(masterPidStr.c_str());
if(masterPid)
{
if (kill(masterPid, 0) == 0)
{
LogError("Another instance exist, ready to quit!");
return false;
}
}
initAsDaemon();
char buf[64];
sprintf(buf, "%d", getpid());
if (!WriteStr2File(m_pidFile,buf ,"w"))
{
LogFatal("Write master pid fail!");
}
while(true)
{
pid_t pid = fork();
if (0 == pid)// child process do
{
signal(SIGUSR1, sigChildHandler);
signal(SIGPIPE, SIG_IGN);
signal(SIGTTOU, SIG_IGN);
signal(SIGTTIN, SIG_IGN);
signal(SIGTERM, SIG_IGN);
signal(SIGINT, SIG_IGN);
signal(SIGQUIT, SIG_IGN);
if(!m_pHandler->init())
{
LogFatal("m_pHandler init failed!");
return false;
}
#ifdef DEBUG
LogDebug("Worker init ok pid = %d",(int)getpid());
#endif
if (!m_pHandler->run())
{
LogError("m_pHandler run finish with failure!");
return false;
}
#ifdef DEBUG
LogDebug("run finish -ok!");
#endif
//if(!m_pHandler->dispose())
//{
// LogError("m_pHandler dispose with failure!");
// return false;
//}
#ifdef DEBUG
//LogDebug("Worker dispose -ok!");
#endif
exit(0);
}
m_nChildPid=pid;
int status;
pid = wait(&status);
if (!isAbnormalExit(pid, status))
{
LogDebug("child exit normally! and Daemon exit");
break;
}
}
return true;
}
bool Daemon::stop()
{
string masterPidStr = loadFile2Str(m_pidFile);
int masterPid = atoi(masterPidStr.c_str());
if(masterPid)
{
#ifdef DEBUG
LogDebug("read last masterPid[%d]",masterPid);
#endif
if (kill(masterPid, 0) == 0)
{
#ifdef DEBUG
LogDebug("find previous daemon pid= %d, current pid= %d", masterPid, getpid());
#endif
kill(masterPid, SIGTERM);
int tryTime = 200;
while (kill(masterPid, 0) == 0 && --tryTime)
{
sleep(1);
}
if (!tryTime && kill(masterPid, 0) == 0)
{
LogError("Time out shutdown fail!");
return false;
}
LogInfo("previous daemon pid[%d] shutdown ok.", masterPid);
return true;
}
}
LogError("Another instance doesn't exist, ready to quit!");
return false;
}
void Daemon::initAsDaemon()
{
if (fork() > 0)
exit(0);
setsid();
signal(SIGPIPE, SIG_IGN);
signal(SIGTTOU, SIG_IGN);
signal(SIGTTIN, SIG_IGN);
signal(SIGTERM, sigMasterHandler);
signal(SIGINT, sigMasterHandler);
signal(SIGQUIT, sigMasterHandler);
signal(SIGKILL, sigMasterHandler);
}
void Daemon::sigMasterHandler(int sig)
{
kill(m_nChildPid,SIGUSR1);
LogDebug("master = %d sig child =%d!",getpid(),m_nChildPid);
}
void Daemon::sigChildHandler(int sig)
{
if (sig == SIGUSR1)
{
m_pHandler->dispose();
LogDebug("master = %d signal accept current pid =%d!",getppid(),getpid());
}
}
}

View File

@ -1,51 +0,0 @@
#ifndef HUSKY_DAEMON_H_
#define HUSKY_DAEMON_H_
#include "globals.h"
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <signal.h>
#include "../Limonp/logger.hpp"
namespace Husky
{
using namespace Limonp;
class IWorkHandler
{
public:
virtual ~IWorkHandler(){}
virtual bool init() = 0;
virtual bool dispose() = 0;
virtual bool run() = 0;
};
class Daemon
{
public:
Daemon(IWorkHandler * workHandler, const char* pidFile)
{
m_pHandler = workHandler;
m_pidFile = pidFile;
}
~Daemon(){};
public:
bool start();
bool stop();
public:
static void initAsDaemon();
static void sigMasterHandler(int sig);
static void sigChildHandler(int sig);
static bool isAbnormalExit(int pid, int status);
private:
//static IRequestHandler* m_pHandler;
//static ServerFrame m_ServerFrame;
static int m_nChildPid;
static IWorkHandler * m_pHandler;
static const char* m_pidFile;
};
}
#endif

View File

@ -1,226 +0,0 @@
#ifndef HUSKY_HTTP_REQINFO_H
#define HUSKY_HTTP_REQINFO_H
#include <iostream>
#include <string>
#include "../Limonp/logger.hpp"
#include "../Limonp/str_functs.hpp"
#include "globals.h"
namespace Husky
{
using namespace Limonp;
using namespace std;
static const char* const KEY_METHOD = "METHOD";
static const char* const KEY_PATH = "PATH";
static const char* const KEY_PROTOCOL = "PROTOCOL";
typedef unsigned char BYTE;
inline BYTE toHex(BYTE x)
{
return x > 9 ? x -10 + 'A': x + '0';
}
inline BYTE fromHex(BYTE x)
{
return isdigit(x) ? x-'0' : x-'A'+10;
}
inline void URLEncode(const string &sIn, string& sOut)
{
for( size_t ix = 0; ix < sIn.size(); ix++ )
{
BYTE buf[4];
memset( buf, 0, 4 );
if( isalnum( (BYTE)sIn[ix] ) )
{
buf[0] = sIn[ix];
}
//else if ( isspace( (BYTE)sIn[ix] ) ) //貌似把空格编码成%20或者+都可以
//{
// buf[0] = '+';
//}
else
{
buf[0] = '%';
buf[1] = toHex( (BYTE)sIn[ix] >> 4 );
buf[2] = toHex( (BYTE)sIn[ix] % 16);
}
sOut += (char *)buf;
}
};
inline void URLDecode(const string &sIn, string& sOut)
{
for( size_t ix = 0; ix < sIn.size(); ix++ )
{
BYTE ch = 0;
if(sIn[ix]=='%')
{
ch = (fromHex(sIn[ix+1])<<4);
ch |= fromHex(sIn[ix+2]);
ix += 2;
}
else if(sIn[ix] == '+')
{
ch = ' ';
}
else
{
ch = sIn[ix];
}
sOut += (char)ch;
}
}
class HttpReqInfo
{
public:
bool load(const string& headerStr)
{
size_t lpos = 0, rpos = 0;
vector<string> buf;
rpos = headerStr.find("\n", lpos);
if(string::npos == rpos)
{
LogFatal("headerStr illegal.");
return false;
}
string firstline(headerStr, lpos, rpos - lpos);
trim(firstline);
if(!splitStr(firstline, buf, " ") || 3 != buf.size())
{
LogFatal("parse header first line failed.");
return false;
}
_headerMap[KEY_METHOD] = trim(buf[0]);
_headerMap[KEY_PATH] = trim(buf[1]);
_headerMap[KEY_PROTOCOL] = trim(buf[2]);
//first request line end
//parse path to _methodGetMap
if("GET" == _headerMap[KEY_METHOD])
{
_parseUrl(firstline, _methodGetMap);
}
lpos = rpos + 1;
if(lpos >= headerStr.size())
{
LogFatal("headerStr illegal");
return false;
}
//message header begin
while(lpos < headerStr.size() && string::npos != (rpos = headerStr.find('\n', lpos)) && rpos > lpos)
{
string s(headerStr, lpos, rpos - lpos);
size_t p = s.find(':');
if(string::npos == p)
{
break;//encounter empty line
}
string k(s, 0, p);
string v(s, p+1);
trim(k);
trim(v);
if(k.empty()||v.empty())
{
LogFatal("headerStr illegal.");
return false;
}
upper(k);
_headerMap[k] = v;
lpos = rpos + 1;
}
//message header end
//body begin
return true;
}
public:
string& operator[] (const string& key)
{
return _headerMap[key];
}
bool find(const string& key, string& res)const
{
return _find(_headerMap, key, res);
}
bool GET(const string& argKey, string& res)const
{
return _find(_methodGetMap, argKey, res);
}
bool POST(const string& argKey, string& res)const
{
return _find(_methodPostMap, argKey, res);
}
private:
HashMap<string, string> _headerMap;
HashMap<string, string> _methodGetMap;
HashMap<string, string> _methodPostMap;
//public:
friend ostream& operator<<(ostream& os, const HttpReqInfo& obj);
private:
bool _find(const HashMap<string, string>& mp, const string& key, string& res)const
{
HashMap<string, string>::const_iterator it = mp.find(key);
if(it == mp.end())
{
return false;
}
res = it->second;
return true;
}
private:
bool _parseUrl(const string& url, HashMap<string, string>& mp)
{
if(url.empty())
{
return false;
}
uint pos = url.find('?');
if(string::npos == pos)
{
return false;
}
uint kleft = 0, kright = 0;
uint vleft = 0, vright = 0;
for(uint i = pos + 1; i < url.size();)
{
kleft = i;
while(i < url.size() && url[i] != '=')
{
i++;
}
if(i >= url.size())
{
break;
}
kright = i;
i++;
vleft = i;
while(i < url.size() && url[i] != '&' && url[i] != ' ')
{
i++;
}
vright = i;
mp[url.substr(kleft, kright - kleft)] = url.substr(vleft, vright - vleft);
i++;
}
return true;
}
};
inline std::ostream& operator << (std::ostream& os, const Husky::HttpReqInfo& obj)
{
return os << obj._headerMap << obj._methodGetMap << obj._methodPostMap;
}
}
#endif

View File

@ -1,231 +0,0 @@
#include "ServerFrame.h"
namespace Husky
{
const struct timeval ServerFrame::m_timev = {SOCKET_TIMEOUT, 0};
pthread_mutex_t ServerFrame::m_pmAccept;
bool ServerFrame::m_bShutdown = false;
bool ServerFrame::dispose()
{
m_bShutdown=true;
if (SOCKET_ERROR==closesocket(m_lsnSock))
{
LogError("error [%s]", strerror(errno));
return false;
}
int sockfd;
struct sockaddr_in dest;
if ((sockfd = socket(AF_INET, SOCK_STREAM, 0)) < 0)
{
LogError("error [%s]", strerror(errno));
return false;
}
bzero(&dest, sizeof(dest));
dest.sin_family = AF_INET;
dest.sin_port = htons(m_nLsnPort);
if (inet_aton("127.0.0.1", (struct in_addr *) &dest.sin_addr.s_addr) == 0)
{
LogError("error [%s]", strerror(errno));
return false;
}
if (connect(sockfd, (struct sockaddr *) &dest, sizeof(dest)) < 0)
{
LogError("error [%s]", strerror(errno));
}
close(sockfd);
if(!m_pHandler->dispose())
{
LogFatal("m_pHandler dispose failed.");
}
return true;
}
bool ServerFrame::run()
{
if(SOCKET_ERROR==listen(m_lsnSock,LISEN_QUEUR_LEN))
{
LogError("error [%s]", strerror(errno));
return false;
}
ThreadManager thrMngr;
int i;
SPara para;
para.hSock=m_lsnSock;
para.pHandler=m_pHandler;
for (i=0;i<m_nThreadCount;i++)
{
if (0!=thrMngr.CreateThread(ServerThread, &para))
{
break;
}
}
LogDebug("expect thread count %d, real count %d",m_nThreadCount,i);
if(i==0)
{
LogError("error [%s]", strerror(errno));
return false;
}
LogInfo("server start to run.........");
if (thrMngr.WaitMultipleThread()!=0)
{
return false;
}
return true;
}
void* ServerFrame::ServerThread(void *lpParameter )
{
SPara *pPara=(SPara*)lpParameter;
SOCKET hSockLsn=pPara->hSock;
IRequestHandler *pHandler=pPara->pHandler;
int nRetCode;
linger lng;
char chRecvBuf[RECV_BUFFER];
SOCKET hClientSock;
string strHttpResp;
sockaddr_in clientaddr;
socklen_t nSize = sizeof(clientaddr);
while(!m_bShutdown)
{
HttpReqInfo httpReq;
pthread_mutex_lock(&m_pmAccept);
hClientSock=accept(hSockLsn,(sockaddr *)&clientaddr, &nSize);
pthread_mutex_unlock(&m_pmAccept);
if(hClientSock==SOCKET_ERROR)
{
if(!m_bShutdown)
LogError("error [%s]", strerror(errno));
continue;
}
httpReq[CLIENT_IP_K] = inet_ntoa(clientaddr.sin_addr);// inet_ntoa is not thread safety at some version
lng.l_linger=1;
lng.l_onoff=1;
if(SOCKET_ERROR==setsockopt(hClientSock,SOL_SOCKET,SO_LINGER,(char*)&lng,sizeof(lng)))
{
LogError("error [%s]", strerror(errno));
}
if(SOCKET_ERROR==setsockopt(hClientSock,SOL_SOCKET,SO_RCVTIMEO,(char*)&m_timev,sizeof(m_timev)))
{
LogError("error [%s]", strerror(errno));
}
if(SOCKET_ERROR==setsockopt(hClientSock,SOL_SOCKET,SO_SNDTIMEO,(char*)&m_timev,sizeof(m_timev)))
{
LogError("error [%s]", strerror(errno));
}
string strRec;
string strSnd;
memset(chRecvBuf,0,sizeof(chRecvBuf));
nRetCode = recv(hClientSock, chRecvBuf, RECV_BUFFER, 0);
strRec = chRecvBuf;
#ifdef HUKSY_DEBUG
LogDebug("request[%s]", strRec.c_str());
#endif
if(SOCKET_ERROR==nRetCode)
{
LogDebug("error [%s]", strerror(errno));
closesocket(hClientSock);
continue;
}
if(0==nRetCode)
{
LogDebug("connection has been gracefully closed");
closesocket(hClientSock);
continue;
}
httpReq.load(strRec);
pHandler->do_GET(httpReq, strSnd);
char chHttpHeader[2048];
sprintf(chHttpHeader, RESPONSE_FORMAT, RESPONSE_CHARSET_UTF8, int(strSnd.length()));
strHttpResp=chHttpHeader;
strHttpResp+=strSnd;
#ifdef HUKSY_DEBUG
LogDebug("response'body [%s]", strSnd.c_str());
#endif
if (SOCKET_ERROR==send(hClientSock,strHttpResp.c_str(),strHttpResp.length(),0))
{
LogError("error [%s]", strerror(errno));
}
closesocket(hClientSock);
}
return 0;
}
bool ServerFrame::init()
{
if (!BindToLocalHost(m_lsnSock,m_nLsnPort))
{
LogFatal("BindToLocalHost failed.");
return false;
}
LogInfo("init ok {port:%d, threadNum:%d}", m_nLsnPort, m_nThreadCount);
if(!m_pHandler->init())
{
LogFatal("m_pHandler init failed.");
return false;
}
return true;
}
bool ServerFrame::BindToLocalHost(SOCKET &sock,u_short nPort)
{
sock=socket(AF_INET,SOCK_STREAM,0);
if(INVALID_SOCKET==sock)
{
LogError("error [%s]", strerror(errno));
return false;
}
int nRet = 1;
if(SOCKET_ERROR==setsockopt(m_lsnSock, SOL_SOCKET, SO_REUSEADDR, (char*)&nRet, sizeof(nRet)))
{
LogError("error [%s]", strerror(errno));
}
struct sockaddr_in addrSock;
addrSock.sin_family=AF_INET;
addrSock.sin_port=htons(nPort);
addrSock.sin_addr.s_addr=htonl(INADDR_ANY);
int retval;
retval = ::bind(sock,(sockaddr*)&addrSock,sizeof(sockaddr));
if(SOCKET_ERROR==retval)
{
LogError("error [%s]", strerror(errno));
closesocket(sock);
return false;
}
return true;
}
}

View File

@ -1,85 +0,0 @@
#ifndef HUSKY_SERVERFRAME_H
#define HUSKY_SERVERFRAME_H
#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <pthread.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <vector>
#include "globals.h"
#include "ThreadManager.hpp"
#include "HttpReqInfo.hpp"
#include "Daemon.h"
#define INVALID_SOCKET -1
#define SOCKET_ERROR -1
#define closesocket close
#define RECV_BUFFER 10240
#define LISEN_QUEUR_LEN 1024
namespace Husky
{
using namespace Limonp;
typedef int SOCKET;
class IRequestHandler
{
public:
virtual ~IRequestHandler(){};
public:
virtual bool init() = 0;
virtual bool dispose() = 0;
virtual bool do_GET(const HttpReqInfo& httpReq, string& res) = 0;
};
struct SPara
{
SOCKET hSock;
IRequestHandler * pHandler;
};
class ServerFrame: public IWorkHandler
{
public:
ServerFrame(unsigned nPort, unsigned nThreadCount, IRequestHandler* pHandler)
{
m_nLsnPort = nPort;
m_nThreadCount = nThreadCount;
m_pHandler = pHandler;
pthread_mutex_init(&m_pmAccept,NULL);
};
virtual ~ServerFrame(){pthread_mutex_destroy(&m_pmAccept);};
virtual bool init();
virtual bool dispose();
virtual bool run();
protected:
bool BindToLocalHost(SOCKET &sock,u_short nPort);
static void * ServerThread(void * lpParameter );
private:
u_short m_nLsnPort;
u_short m_nThreadCount;
SOCKET m_lsnSock;
IRequestHandler *m_pHandler;
static bool m_bShutdown;
static pthread_mutex_t m_pmAccept;
static const struct timeval m_timev;
};
}
#endif

View File

@ -1,98 +0,0 @@
#ifndef HUSKY_THREAD_MANAGER_H
#define HUSKY_THREAD_MANAGER_H
#include <pthread.h>
#include <algorithm>
#include <vector>
#include <map>
#define INFINITE 0
namespace Husky
{
using namespace std;
class ThreadManager
{
private:
typedef int HANDLE;
typedef int DWORD;
typedef void *(* PThreadFunc)(void* param);
public:
ThreadManager(){;}
~ThreadManager(){}
unsigned int HandleCount(){return m_vecHandle.size();}
void clear()
{
m_vecHandle.clear();
}
HANDLE CreateThread( PThreadFunc pFunc,void *pPara)
{
pthread_t pt;
int nErrorCode=pthread_create(&pt,NULL,pFunc,pPara);
if(nErrorCode!=0)
return nErrorCode;
m_vecHandle.push_back(pt); //加入线程列表 为WaitForMultipleObjects准备
return nErrorCode;
}
//hThread (thread handler) : 为0时为默认最后一个加入管理器的线程句柄
//dwMilliseconds等待时间 : 单位毫秒,默认值无穷时间
//return value : -1句柄无效其他值 WaitForSingleObject函数的返回值
DWORD Wait(HANDLE hThread=0,DWORD dwMilliseconds=INFINITE )
{
if( hThread==0)//最后一个加入的线程
{
if(!m_vecHandle.empty())
{
return pthread_join(m_vecHandle.back(),NULL);
}
else
return -1;
}
else
{
if (find(m_vecHandle.begin(),m_vecHandle.end(),hThread)==m_vecHandle.end())//不存在此句柄
{
return -1;
}
return pthread_join(hThread, NULL);
}
}
//等待所有线程执行完毕
//bWaitAll是否所有线程 : 默认值1等待所有线程,0有任何线程结束此函数返回
//dwMilliseconds : 单位毫秒,默认值无穷时间
//return value : -1没有任何句柄其他值 WaitForMultipleObjects函数的返回值
DWORD WaitMultipleThread( bool bWaitAll=1,DWORD dwMilliseconds=INFINITE)
{
if (m_vecHandle.empty())
return -1;
int nErrorcode;
for (uint i=0;i<m_vecHandle.size();++i)
{
nErrorcode=pthread_join(m_vecHandle[i], NULL);
if (nErrorcode!=0)
return nErrorcode;
}
return 0;
}
private:
vector<pthread_t> m_vecHandle;
private:
ThreadManager(const ThreadManager&){;}// copy forbidden
void operator=(const ThreadManager &){}// copy forbidden
};
}
#endif

View File

@ -1,24 +0,0 @@
#ifndef HUSKY_GLOBALS_H
#define HUSKY_GLOBALS_H
#include <string>
#include <vector>
#include <set>
#include <string>
#include <stdlib.h>
namespace Husky
{
const char* const RESPONSE_CHARSET_UTF8 = "UTF-8";
const char* const RESPONSE_CHARSET_GB2312 = "GB2312";
const char* const CLIENT_IP_K = "CLIENT_IP";
const unsigned int SOCKET_TIMEOUT = 2;
const char* const RESPONSE_FORMAT = "HTTP/1.1 200 OK\r\nConnection: close\r\nServer: FrameServer/1.0.0\r\nContent-Type: text/json; charset=%s\r\nContent-Length: %d\r\n\r\n";
typedef unsigned short u_short;
typedef unsigned int u_int;
}
#endif

View File

@ -1,18 +0,0 @@
#ifndef CPPJIEBA_SEGMENTINTERFACE_H
#define CPPJIEBA_SEGMENTINTERFACE_H
#include "globals.h"
namespace CppJieba
{
class ISegment
{
//public:
// virtual ~ISegment(){};
public:
virtual bool cut(Unicode::const_iterator begin , Unicode::const_iterator end, vector<string>& res) const = 0;
virtual bool cut(const string& str, vector<string>& res) const = 0;
};
}
#endif

View File

@ -1,90 +0,0 @@
/************************************
* file enc : ascii
* author : wuyanyi09@gmail.com
************************************/
#ifndef LIMONP_ARGV_FUNCTS_H
#define LIMONP_ARGV_FUNCTS_H
#include <set>
#include <sstream>
#include "str_functs.hpp"
#include "map_functs.hpp"
namespace Limonp
{
using namespace std;
class ArgvContext
{
public :
ArgvContext(int argc, const char* const * argv)
{
for(int i = 0; i < argc; i++)
{
if(strStartsWith(argv[i], "-"))
{
if(i + 1 < argc && !strStartsWith(argv[i + 1], "-"))
{
_mpss[argv[i]] = argv[i+1];
i++;
}
else
{
_sset.insert(argv[i]);
}
}
else
{
_args.push_back(argv[i]);
}
}
}
~ArgvContext(){};
public:
friend ostream& operator << (ostream& os, const ArgvContext& args);
string operator [](uint i) const
{
if(i < _args.size())
{
return _args[i];
}
return "";
}
string operator [](const string& key) const
{
map<string, string>::const_iterator it = _mpss.find(key);
if(it != _mpss.end())
{
return it->second;
}
return "";
}
public:
bool hasKey(const string& key) const
{
if(_mpss.find(key) != _mpss.end() || _sset.find(key) != _sset.end())
{
return true;
}
return false;
}
private:
vector<string> _args;
map<string, string> _mpss;
set<string> _sset;
};
inline ostream& operator << (ostream& os, const ArgvContext& args)
{
return os<<args._args<<args._mpss<<args._sset;
}
//string toString()
//{
// stringstream ss;
// return ss.str();
//}
}
#endif

View File

@ -1 +0,0 @@
INSTALL(FILES ArgvContext.hpp io_functs.hpp macro_def.hpp MysqlClient.hpp str_functs.hpp cast_functs.hpp Config.hpp logger.hpp map_functs.hpp std_outbound.hpp DESTINATION include/CppJieba/Limonp)

View File

@ -1,82 +0,0 @@
/************************************
* file enc : utf8
* author : wuyanyi09@gmail.com
************************************/
#ifndef LIMONP_CONFIG_H
#define LIMONP_CONFIG_H
#include <map>
#include <fstream>
#include <iostream>
#include "logger.hpp"
#include "str_functs.hpp"
namespace Limonp
{
using namespace std;
class Config
{
public:
bool loadFile(const char * const filePath)
{
ifstream ifs(filePath);
if(!ifs)
{
LogFatal("open file[%s] failed.", filePath);
return false;
}
string line;
vector<string> vecBuf;
uint lineno = 0;
while(getline(ifs, line))
{
lineno ++;
trim(line);
if(line.empty() || strStartsWith(line, "#"))
{
continue;
}
vecBuf.clear();
if(!splitStr(line, vecBuf, "=") || 2 != vecBuf.size())
{
LogFatal("line[%d:%s] is illegal.", lineno, line.c_str());
return false;
}
string& key = vecBuf[0];
string& value = vecBuf[1];
trim(key);
trim(value);
if(_map.end() != _map.find(key))
{
LogFatal("key[%s] already exists.", key.c_str());
return false;
}
_map[key] = value;
}
ifs.close();
return true;
}
bool get(const string& key, string& value) const
{
map<string, string>::const_iterator it = _map.find(key);
if(_map.end() != it)
{
value = it->second;
return true;
}
return false;
}
private:
map<string, string> _map;
private:
friend ostream& operator << (ostream& os, const Config& config);
};
ostream& operator << (ostream& os, const Config& config)
{
return os << config._map;
}
}
#endif

View File

@ -1,126 +0,0 @@
#ifndef LIMONP_MYSQLCLIENT_H
#define LIMONP_MYSQLCLIENT_H
#include <mysql.h>
#include <iostream>
#include <vector>
#include <string>
#include "logger.hpp"
namespace Limonp
{
using namespace std;
class MysqlClient
{
public:
typedef vector< vector<string> > RowsType;
private:
const char * const HOST;
const unsigned int PORT;
const char * const USER;
const char * const PASSWD;
const char * const DB;
const char * const CHARSET;
public:
MysqlClient(const char* host, uint port, const char* user, const char* passwd, const char* db, const char* charset = "utf8"): HOST(host), PORT(port), USER(user), PASSWD(passwd), DB(db), CHARSET(charset){ _conn = NULL;};
~MysqlClient(){dispose();};
public:
bool init()
{
//cout<<mysql_get_client_info()<<endl;
if(NULL == (_conn = mysql_init(NULL)))
{
LogError("mysql_init faield. %s", mysql_error(_conn));
return false;
}
if (mysql_real_connect(_conn, HOST, USER, PASSWD, DB, PORT, NULL, 0) == NULL)
{
LogError("mysql_real_connect failed. %s", mysql_error(_conn));
mysql_close(_conn);
_conn = NULL;
return false;
}
if(mysql_set_character_set(_conn, CHARSET))
{
LogError("mysql_set_character_set [%s] failed.", CHARSET);
return false;
}
//set reconenct
char value = 1;
mysql_options(_conn, MYSQL_OPT_RECONNECT, &value);
LogInfo("MysqlClient {host: %s, port:%d, database:%s, charset:%s}", HOST, PORT, DB, CHARSET);
return true;
}
bool dispose()
{
if(NULL != _conn)
{
mysql_close(_conn);
}
_conn = NULL;
return true;
}
bool executeSql(const char* sql)
{
if(NULL == _conn)
{
LogError("_conn is NULL");
return false;
}
if(mysql_query(_conn, sql))
{
LogError("mysql_query failed. %s", mysql_error(_conn));
return false;
}
return true;
}
uint insert(const char* tb_name, const char* keys, const vector<string>& vals)
{
uint retn = 0;
string sql;
for(uint i = 0; i < vals.size(); i ++)
{
sql.clear();
string_format(sql, "insert into %s (%s) values %s", tb_name, keys, vals[i].c_str());
retn += executeSql(sql.c_str());
}
return retn;
}
bool select(const char* sql, RowsType& rows)
{
if(!executeSql(sql))
{
LogError("executeSql failed. [%s]", sql);
return false;
}
MYSQL_RES * result = mysql_store_result(_conn);
if(NULL == result)
{
LogError("mysql_store_result failed.[%d]", mysql_error(_conn));
}
uint num_fields = mysql_num_fields(result);
MYSQL_ROW row;
while((row = mysql_fetch_row(result)))
{
vector<string> vec;
for(uint i = 0; i < num_fields; i ++)
{
row[i] ? vec.push_back(row[i]) : vec.push_back("NULL");
}
rows.push_back(vec);
}
mysql_free_result(result);
return true;
}
private:
MYSQL * _conn;
};
}
#endif

View File

@ -1,87 +0,0 @@
#ifndef LIMONP_CAST_FUNCTS_H
#define LIMONP_CAST_FUNCTS_H
namespace Limonp
{
//logical and or
static const int sign_32 = 0xC0000000;
static const int exponent_32 = 0x07800000;
static const int mantissa_32 = 0x007FE000;
static const int sign_exponent_32 = 0x40000000;
static const int loss_32 = 0x38000000;
static const short sign_16 = (short)0xC000;
static const short exponent_16 = (short)0x3C00;
static const short mantissa_16 = (short)0x03FF;
static const short sign_exponent_16 = (short)0x4000;
static const int exponent_fill_32 = 0x38000000;
//infinite
static const short infinite_16 = (short) 0x7FFF;
static const short infinitesmall_16 = (short) 0x0000;
inline float intBitsToFloat(unsigned int x)
{
union
{
float f;
int i;
}u;
u.i = x;
return u.f;
}
inline int floatToIntBits(float f)
{
union
{
float f;
int i ;
}u;
u.f = f;
return u.i;
}
inline short floatToShortBits(float f)
{
int fi = floatToIntBits(f);
// 提取关键信息
short sign = (short) ((unsigned int)(fi & sign_32) >> 16);
short exponent = (short) ((unsigned int)(fi & exponent_32) >> 13);
short mantissa = (short) ((unsigned int)(fi & mantissa_32) >> 13);
// 生成编码结果
short code = (short) (sign | exponent | mantissa);
// 无穷大量、无穷小量的处理
if ((fi & loss_32) > 0 && (fi & sign_exponent_32) > 0) {
// 当指数符号为1时(正次方)且左234位为1返回无穷大量
return (short) (code | infinite_16);
}
if (((fi & loss_32) ^ loss_32) > 0 && (fi & sign_exponent_32) == 0) {
// 当指数符号位0时(负次方)且左234位为0(与111异或>0),返回无穷小量
return infinitesmall_16;
}
return code;
}
inline float shortBitsToFloat(short s)
{
/*
* 31001 0(13)
*/
int sign = ((int) (s & sign_16)) << 16;
int exponent = ((int) (s & exponent_16)) << 13;
// 指数符号位为0234位补1
if ((s & sign_exponent_16) == 0 && s != 0) {
exponent |= exponent_fill_32;
}
int mantissa = ((int) (s & mantissa_16)) << 13;
// 生成解码结果
int code = sign | exponent | mantissa;
return intBitsToFloat(code);
}
}
#endif

View File

@ -1,82 +0,0 @@
/************************************
* file enc : utf8
* author : wuyanyi09@gmail.com
************************************/
#ifndef LIMONP_IO_FUNCTS_H
#define LIMONP_IO_FUNCTS_H
#include <fstream>
#include <iostream>
#include <stdlib.h>
namespace Limonp
{
using namespace std;
inline string loadFile2Str(const char * const filepath)
{
ifstream in(filepath);
if(!in)
{
return "";
}
istreambuf_iterator<char> beg(in), end;
string str(beg, end);
in.close();
return str;
}
inline void loadStr2File(const char * const filename, ios_base::openmode mode, const string& str)
{
ofstream out(filename, mode);
ostreambuf_iterator<char> itr (out);
copy(str.begin(), str.end(), itr);
out.close();
}
inline int ReadFromFile(const char * fileName, char* buf, int maxCount, const char* mode)
{
FILE* fp = fopen(fileName, mode);
if (!fp)
return 0;
int ret;
fgets(buf, maxCount, fp) ? ret = 1 : ret = 0;
fclose(fp);
return ret;
}
inline int WriteStr2File(const char* fileName, const char* buf, const char* mode)
{
FILE* fp = fopen(fileName, mode);
if (!fp)
return 0;
int n = fprintf(fp, "%s", buf);
fclose(fp);
return n;
}
inline bool checkFileExist(const string& filePath)
{
fstream _file;
_file.open(filePath.c_str(), ios::in);
if(_file)
return true;
return false;
}
inline bool createDir(const string& dirPath, bool p = true)
{
string dir_str(dirPath);
string cmd = "mkdir";
if(p)
{
cmd += " -p";
}
cmd += " " + dir_str;
int res = system(cmd.c_str());
return res;
}
inline bool checkDirExist(const string& dirPath)
{
return checkFileExist(dirPath);
}
}
#endif

View File

@ -1,80 +0,0 @@
/************************************
* file enc : utf8
* author : wuyanyi09@gmail.com
************************************/
#ifndef LIMONP_LOGGER_H
#define LIMONP_LOGGER_H
#include <vector>
#include <iostream>
#include <fstream>
#include <string>
#include <stdio.h>
#include <stdarg.h>
#include "io_functs.hpp"
#include "str_functs.hpp"
#define FILE_BASENAME strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__
#define LogDebug(fmt, ...) Logger::LoggingF(LL_DEBUG, FILE_BASENAME, __LINE__, fmt, ## __VA_ARGS__)
#define LogInfo(fmt, ...) Logger::LoggingF(LL_INFO, FILE_BASENAME, __LINE__, fmt, ## __VA_ARGS__)
#define LogWarn(fmt, ...) Logger::LoggingF(LL_WARN, FILE_BASENAME, __LINE__, fmt, ## __VA_ARGS__)
#define LogError(fmt, ...) Logger::LoggingF(LL_ERROR, FILE_BASENAME, __LINE__, fmt, ## __VA_ARGS__)
#define LogFatal(fmt, ...) Logger::LoggingF(LL_FATAL, FILE_BASENAME, __LINE__, fmt, ## __VA_ARGS__)
namespace Limonp
{
using namespace std;
enum {LL_DEBUG = 0, LL_INFO = 1, LL_WARN = 2, LL_ERROR = 3, LL_FATAL = 4, LEVEL_ARRAY_SIZE = 5, CSTR_BUFFER_SIZE = 1024};
static const char * LOG_LEVEL_ARRAY[LEVEL_ARRAY_SIZE]= {"DEBUG","INFO","WARN","ERROR","FATAL"};
static const char * LOG_FORMAT = "%s %s:%d %s %s\n";
static const char * LOG_TIME_FORMAT = "%Y-%m-%d %H:%M:%S";
class Logger
{
public:
static bool Logging(uint level, const string& msg, const char* fileName, int lineNo)
{
if(level > LL_FATAL)
{
cerr<<"level's value is out of range"<<endl;
return false;
}
char buf[CSTR_BUFFER_SIZE];
time_t timeNow;
time(&timeNow);
size_t ret = strftime(buf, sizeof(buf), LOG_TIME_FORMAT, localtime(&timeNow));
if(0 == ret)
{
fprintf(stderr, "stftime failed.\n");
return false;
}
fprintf(stderr, LOG_FORMAT, buf, fileName, lineNo,LOG_LEVEL_ARRAY[level], msg.c_str());
return true;
}
static bool LoggingF(uint level, const char* fileName, int lineNo, const string& fmt, ...)
{
int size = 256;
string msg;
va_list ap;
while (1) {
msg.resize(size);
va_start(ap, fmt);
int n = vsnprintf((char *)msg.c_str(), size, fmt.c_str(), ap);
va_end(ap);
if (n > -1 && n < size) {
msg.resize(n);
break;
}
if (n > -1)
size = n + 1;
else
size *= 2;
}
return Logging(level, msg, fileName, lineNo);
}
};
}
#endif

View File

@ -1,22 +0,0 @@
#ifndef LIMONP_MACRO_DEF_H
#define LIMONP_MACRO_DEF_H
#define XX_GET_SET(varType, varName, funName)\
private: varType varName;\
public: inline varType get##funName(void) const {return varName;}\
public: inline void set##funName(varType var) {varName = var;}
#define XX_GET(varType, varName, funName)\
private: varType varName;\
public: inline varType get##funName(void) const {return varName;}
#define XX_SET(varType, varName, funName)\
private: varType varName;\
public: inline void set##funName(varType var) {varName = var;}
#define XX_GET_SET_BY_REF(varType, varName, funName)\
private: varType varName;\
public: inline const varType& get##funName(void) const {return varName;}\
public: inline void set##funName(const varType& var){varName = var;}
#endif

View File

@ -1,45 +0,0 @@
/************************************
* file enc : ascii
* author : wuyanyi09@gmail.com
************************************/
#ifndef LIMONP_MAP_FUNCTS_H
#define LIMONP_MAP_FUNCTS_H
#include <map>
#include <set>
#include <iostream>
#include <sstream>
#include <unordered_map>
#define HashMap std::unordered_map
namespace Limonp
{
using namespace std;
template<class kT, class vT>
vT getMap(const map<kT, vT>& mp, const kT & key, const vT & defaultVal)
{
typename map<kT, vT>::const_iterator it;
it = mp.find(key);
if(mp.end() == it)
{
return defaultVal;
}
return it->second;
}
template<class kT, class vT>
void map2Vec(const map<kT, vT>& mp, vector<pair<kT, vT> > & res)
{
typename map<kT, vT>::const_iterator it = mp.begin();
for(; it != mp.end(); it++)
{
res.push_back(*it);
}
}
}
#endif

View File

@ -1,101 +0,0 @@
#ifndef LIMONP_STD_OUTBOUND_H
#define LIMONP_STD_OUTBOUND_H
#include "map_functs.hpp"
#include <map>
#include <set>
namespace std
{
template<typename T>
ostream& operator << (ostream& os, const vector<T>& vec)
{
if(vec.empty())
{
return os << "[]";
}
os<<"[\""<<vec[0];
for(uint i = 1; i < vec.size(); i++)
{
os<<"\", \""<<vec[i];
}
os<<"\"]";
return os;
}
template<class T1, class T2>
ostream& operator << (ostream& os, const pair<T1, T2>& pr)
{
os << pr.first << ":" << pr.second ;
return os;
}
template<class T>
string& operator << (string& str, const T& obj)
{
stringstream ss;
ss << obj; // call ostream& operator << (ostream& os,
return str = ss.str();
}
template<class T1, class T2>
ostream& operator << (ostream& os, const map<T1, T2>& mp)
{
if(mp.empty())
{
os<<"{}";
return os;
}
os<<'{';
typename map<T1, T2>::const_iterator it = mp.begin();
os<<*it;
it++;
while(it != mp.end())
{
os<<", "<<*it;
it++;
}
os<<'}';
return os;
}
template<class T1, class T2>
ostream& operator << (ostream& os, const HashMap<T1, T2>& mp)
{
if(mp.empty())
{
return os << "{}";
}
os<<'{';
typename HashMap<T1, T2>::const_iterator it = mp.begin();
os<<*it;
it++;
while(it != mp.end())
{
os<<", "<<*it++;
}
return os<<'}';
}
template<class T>
ostream& operator << (ostream& os, const set<T>& st)
{
if(st.empty())
{
os << "{}";
return os;
}
os<<'{';
typename set<T>::const_iterator it = st.begin();
os<<*it;
it++;
while(it != st.end())
{
os<<", "<<*it;
it++;
}
os<<'}';
return os;
}
}
#endif

View File

@ -1,258 +0,0 @@
/************************************
* file enc : ascii
* author : wuyanyi09@gmail.com
************************************/
#ifndef LIMONP_STR_FUNCTS_H
#define LIMONP_STR_FUNCTS_H
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <cctype>
#include <map>
#include <stdint.h>
#include <stdio.h>
#include <stdarg.h>
#include <memory.h>
#include <functional>
#include <locale>
#include <sstream>
#include <sys/types.h>
#include <iterator>
#include <algorithm>
#include "std_outbound.hpp"
#include "map_functs.hpp"
#define print(x) cout<<(x)<<endl
namespace Limonp
{
using namespace std;
inline string string_format(const char* fmt, ...)
{
int size = 256;
std::string str;
va_list ap;
while (1) {
str.resize(size);
va_start(ap, fmt);
int n = vsnprintf((char *)str.c_str(), size, fmt, ap);
va_end(ap);
if (n > -1 && n < size) {
str.resize(n);
return str;
}
if (n > -1)
size = n + 1;
else
size *= 2;
}
return str;
}
inline void string_format(string& res, const char* fmt, ...)
{
int size = 256;
va_list ap;
res.clear();
while (1) {
res.resize(size);
va_start(ap, fmt);
int n = vsnprintf((char *)res.c_str(), size, fmt, ap);
va_end(ap);
if (n > -1 && n < size) {
res.resize(n);
return;
}
if (n > -1)
size = n + 1;
else
size *= 2;
}
}
//inline bool joinStr(const vector<string>& src, string& dest, const string& connectorStr)
//{
// if(src.empty())
// {
// return false;
// }
// for(uint i = 0; i < src.size() - 1; i++)
// {
// dest += src[i];
// dest += connectorStr;
// }
// dest += src[src.size() - 1];
// return true;
//}
//inline string joinStr(const vector<string>& source, const string& connector)
//{
// string res;
// joinStr(source, res, connector);
// return res;
//}
template<class T>
void join(T begin, T end, string& res, const string& connector)
{
if(begin == end)
{
return;
}
stringstream ss;
ss<<*begin;
begin++;
while(begin != end)
{
ss << connector << *begin;
begin ++;
}
res = ss.str();
}
template<class T>
string join(T begin, T end, const string& connector)
{
string res;
join(begin ,end, res, connector);
return res;
}
inline bool splitStr(const string& src, vector<string>& res, const string& pattern)
{
if(src.empty())
{
return false;
}
res.clear();
size_t start = 0;
size_t end = 0;
while(start < src.size())
{
end = src.find_first_of(pattern, start);
if(string::npos == end)
{
res.push_back(src.substr(start));
return true;
}
res.push_back(src.substr(start, end - start));
if(end == src.size() - 1)
{
res.push_back("");
break;
}
start = end + 1;
}
return true;
}
inline string& upper(string& str)
{
transform(str.begin(), str.end(), str.begin(), (int (*)(int))toupper);
return str;
}
inline string& lower(string& str)
{
transform(str.begin(), str.end(), str.begin(), (int (*)(int))tolower);
return str;
}
inline std::string &ltrim(std::string &s)
{
s.erase(s.begin(), std::find_if(s.begin(), s.end(), std::not1(std::ptr_fun<int, int>(std::isspace))));
return s;
}
inline std::string &rtrim(std::string &s)
{
s.erase(std::find_if(s.rbegin(), s.rend(), std::not1(std::ptr_fun<int, int>(std::isspace))).base(), s.end());
return s;
}
inline std::string &trim(std::string &s)
{
return ltrim(rtrim(s));
}
inline uint16_t twocharToUint16(char high, char low)
{
return (((uint16_t(high) & 0x00ff ) << 8) | (uint16_t(low) & 0x00ff));
}
inline pair<char, char> uint16ToChar2(uint16_t in)
{
pair<char, char> res;
res.first = (in>>8) & 0x00ff; //high
res.second = (in) & 0x00ff; //low
return res;
}
inline bool strStartsWith(const string& str, const string& prefix)
{
//return str.substr(0, prefix.size()) == prefix;
if(prefix.length() > str.length())
{
return false;
}
return 0 == str.compare(0, prefix.length(), prefix);
}
inline bool strEndsWith(const string& str, const string& suffix)
{
if(suffix.length() > str.length())
{
return false;
}
return 0 == str.compare(str.length() - suffix.length(), suffix.length(), suffix);
}
inline bool isInStr(const string& str, char ch)
{
return str.find(ch) != string::npos;
}
//inline void extractWords(const string& sentence, vector<string>& words)
//{
// bool flag = false;
// uint lhs = 0, len = 0;
// for(uint i = 0; i < sentence.size(); i++)
// {
// char x = sentence[i];
// if((0x0030 <= x && x<= 0x0039) || (0x0041 <= x && x <= 0x005a ) || (0x0061 <= x && x <= 0x007a))
// {
// if(flag)
// {
// len ++;
// }
// else
// {
// lhs = i;
// len = 1;
// }
// flag = true;
// }
// else
// {
// if(flag)
// {
// words.push_back(string(sentence, lhs, len));
// }
// flag = false;
// }
// }
// if(flag)
// {
// words.push_back(string(sentence, lhs, len));
// }
//}
}
#endif

View File

@ -1,265 +0,0 @@
/************************************
* file enc : AISCII
* author : wuyanyi09@gmail.com
************************************/
#include "MPSegment.h"
namespace CppJieba
{
bool MPSegment::init(const char* const filePath)
{
if(_getInitFlag())
{
LogError("already inited before now.");
return false;
}
if(!_trie.init())
{
LogError("_trie.init failed.");
return false;
}
LogInfo("_trie.loadDict(%s) start...", filePath);
if(!_trie.loadDict(filePath))
{
LogError("_trie.loadDict faield.");
return false;
}
LogInfo("_trie.loadDict end.");
return _setInitFlag(true);
}
bool MPSegment::dispose()
{
if(!_getInitFlag())
{
return true;
}
_trie.dispose();
_setInitFlag(false);
return true;
}
bool MPSegment::cut(const string& str, vector<string>& res)const
{
return SegmentBase::cut(str, res);
}
bool MPSegment::cut(Unicode::const_iterator begin, Unicode::const_iterator end, vector<string>& res)const
{
if(!_getInitFlag())
{
LogError("not inited.");
return false;
}
vector<TrieNodeInfo> segWordInfos;
if(!cut(begin, end, segWordInfos))
{
return false;
}
string tmp;
for(uint i = 0; i < segWordInfos.size(); i++)
{
if(TransCode::encode(segWordInfos[i].word, tmp))
{
res.push_back(tmp);
}
else
{
LogError("encode failed.");
}
}
return true;
}
bool MPSegment::cut(Unicode::const_iterator begin , Unicode::const_iterator end, vector<TrieNodeInfo>& segWordInfos)const
{
if(!_getInitFlag())
{
LogError("not inited.");
return false;
}
SegmentContext segContext;
for(Unicode::const_iterator it = begin; it != end; it++)
{
segContext.push_back(SegmentChar(*it));
}
//calc DAG
if(!_calcDAG(segContext))
{
LogError("_calcDAG failed.");
return false;
}
if(!_calcDP(segContext))
{
LogError("_calcDP failed.");
return false;
}
if(!_cut(segContext, segWordInfos))
{
LogError("_cut failed.");
return false;
}
return true;
}
bool MPSegment::cut(const string& str, vector<TrieNodeInfo>& segWordInfos)const
{
if(!_getInitFlag())
{
LogError("not inited.");
return false;
}
if(str.empty())
{
return false;
}
Unicode sentence;
if(!TransCode::decode(str, sentence))
{
LogError("TransCode::decode failed.");
return false;
}
return cut(sentence.begin(), sentence.end(), segWordInfos);
}
bool MPSegment::_calcDAG(SegmentContext& segContext)const
{
if(segContext.empty())
{
LogError("segContext empty.");
return false;
}
Unicode unicode;
for(uint i = 0; i < segContext.size(); i++)
{
unicode.clear();
for(uint j = i ; j < segContext.size(); j++)
{
unicode.push_back(segContext[j].uniCh);
}
vector<pair<uint, const TrieNodeInfo*> > vp;
if(_trie.find(unicode, vp))
{
for(uint j = 0; j < vp.size(); j++)
{
uint nextp = vp[j].first + i;
segContext[i].dag[nextp] = vp[j].second;
//cout<<vp[j].first<<endl;
//LogDebug(vp[j].second->toString());
}
}
if(segContext[i].dag.end() == segContext[i].dag.find(i))
{
segContext[i].dag[i] = NULL;
}
}
return true;
}
bool MPSegment::_calcDP(SegmentContext& segContext)const
{
if(segContext.empty())
{
LogError("segContext empty");
return false;
}
for(int i = segContext.size() - 1; i >= 0; i--)
{
segContext[i].pInfo = NULL;
segContext[i].weight = MIN_DOUBLE;
for(DagType::const_iterator it = segContext[i].dag.begin(); it != segContext[i].dag.end(); it++)
{
uint nextPos = it->first;
const TrieNodeInfo* p = it->second;
double val = 0.0;
if(nextPos + 1 < segContext.size())
{
val += segContext[nextPos + 1].weight;
}
if(p)
{
val += p->logFreq;
}
else
{
val += _trie.getMinLogFreq();
}
if(val > segContext[i].weight)
{
segContext[i].pInfo = p;
segContext[i].weight = val;
}
}
}
return true;
}
bool MPSegment::_cut(SegmentContext& segContext, vector<TrieNodeInfo>& res)const
{
uint i = 0;
while(i < segContext.size())
{
const TrieNodeInfo* p = segContext[i].pInfo;
if(p)
{
res.push_back(*p);
i += p->word.size();
}
else//single chinese word
{
TrieNodeInfo nodeInfo;
nodeInfo.word.push_back(segContext[i].uniCh);
nodeInfo.freq = 0;
nodeInfo.logFreq = _trie.getMinLogFreq();
res.push_back(nodeInfo);
i++;
}
}
return true;
}
}
#ifdef SEGMENT_UT
using namespace CppJieba;
int main()
{
MPSegment segment;
segment.init();
if(!segment._loadSegDict("../dicts/segdict.gbk.v3.0"))
{
cerr<<"1"<<endl;
return 1;
}
//segment.init("dicts/jieba.dict.utf8");
//ifstream ifile("testtitle.gbk");
ifstream ifile("badcase");
vector<string> res;
string line;
while(getline(ifile, line))
{
res.clear();
segment.cut(line, res);
PRINT_VECTOR(res);
getchar();
}
segment.dispose();
return 0;
}
#endif

View File

@ -1,49 +0,0 @@
/************************************
* file enc : ASCII
* author : wuyanyi09@gmail.com
************************************/
#ifndef CPPJIEBA_MPSEGMENT_H
#define CPPJIEBA_MPSEGMENT_H
#include <algorithm>
#include <set>
#include "Limonp/logger.hpp"
#include "Trie.h"
#include "globals.h"
#include "ISegment.hpp"
#include "SegmentBase.hpp"
namespace CppJieba
{
typedef vector<SegmentChar> SegmentContext;
class MPSegment: public SegmentBase
{
private:
Trie _trie;
public:
MPSegment(){};
virtual ~MPSegment(){dispose();};
public:
bool init(const char* const filePath);
bool dispose();
public:
//bool cut(const string& str, vector<TrieNodeInfo>& segWordInfos)const;
bool cut(const string& str, vector<string>& res)const;
bool cut(Unicode::const_iterator begin, Unicode::const_iterator end, vector<string>& res)const;
bool cut(const string& str, vector<TrieNodeInfo>& segWordInfos)const;
bool cut(Unicode::const_iterator begin , Unicode::const_iterator end, vector<TrieNodeInfo>& segWordInfos)const;
//virtual bool cut(const string& str, vector<string>& res)const;
private:
bool _calcDAG(SegmentContext& segContext)const;
bool _calcDP(SegmentContext& segContext)const;
bool _cut(SegmentContext& segContext, vector<TrieNodeInfo>& res)const;
};
}
#endif

View File

@ -1,125 +0,0 @@
#include "MixSegment.h"
namespace CppJieba
{
MixSegment::MixSegment()
{
}
MixSegment::~MixSegment()
{
dispose();
}
bool MixSegment::init(const char* const mpSegDict, const char* const hmmSegDict)
{
if(_getInitFlag())
{
LogError("inited.");
return false;
}
if(!_mpSeg.init(mpSegDict))
{
LogError("_mpSeg init");
return false;
}
if(!_hmmSeg.init(hmmSegDict))
{
LogError("_hmmSeg init");
return false;
}
return _setInitFlag(true);
}
bool MixSegment::dispose()
{
if(!_getInitFlag())
{
return true;
}
_mpSeg.dispose();
_hmmSeg.dispose();
_setInitFlag(false);
return true;
}
bool MixSegment::cut(const string& str, vector<string>& res)const
{
return SegmentBase::cut(str, res);
}
bool MixSegment::cut(Unicode::const_iterator begin, Unicode::const_iterator end, vector<string>& res)const
{
if(!_getInitFlag())
{
LogError("not inited.");
return false;
}
if(begin == end)
{
return false;
}
vector<TrieNodeInfo> infos;
if(!_mpSeg.cut(begin, end, infos))
{
LogError("mpSeg cutDAG failed.");
return false;
}
Unicode unico;
vector<Unicode> hmmRes;
string tmp;
for(uint i= 0; i < infos.size(); i++)
{
TransCode::encode(infos[i].word,tmp);
if(1 == infos[i].word.size())
{
unico.push_back(infos[i].word[0]);
}
else
{
if(!unico.empty())
{
hmmRes.clear();
if(!_hmmSeg.cut(unico.begin(), unico.end(), hmmRes))
{
LogError("_hmmSeg cut failed.");
return false;
}
for(uint j = 0; j < hmmRes.size(); j++)
{
TransCode::encode(hmmRes[j], tmp);
res.push_back(tmp);
}
}
unico.clear();
TransCode::encode(infos[i].word, tmp);
res.push_back(tmp);
}
}
if(!unico.empty())
{
hmmRes.clear();
if(!_hmmSeg.cut(unico.begin(), unico.end(), hmmRes))
{
LogError("_hmmSeg cut failed.");
return false;
}
for(uint j = 0; j < hmmRes.size(); j++)
{
TransCode::encode(hmmRes[j], tmp);
res.push_back(tmp);
}
}
return true;
}
}
#ifdef MIXSEGMENT_UT
using namespace CppJieba;
int main()
{
return 0;
}
#endif

View File

@ -1,28 +0,0 @@
#ifndef CPPJIEBA_MIXSEGMENT_H
#define CPPJIEBA_MIXSEGMENT_H
#include "MPSegment.h"
#include "HMMSegment.h"
#include "Limonp/str_functs.hpp"
namespace CppJieba
{
class MixSegment: public SegmentBase
{
private:
MPSegment _mpSeg;
HMMSegment _hmmSeg;
public:
MixSegment();
virtual ~MixSegment();
public:
bool init(const char* const _mpSegDict, const char* const _hmmSegDict);
bool dispose();
public:
//virtual bool cut(const string& str, vector<string>& res) const;
bool cut(const string& str, vector<string>& res)const;
bool cut(Unicode::const_iterator begin, Unicode::const_iterator end, vector<string>& res)const;
};
}
#endif

View File

@ -1,54 +0,0 @@
#ifndef CPPJIEBA_SEGMENTBASE_H
#define CPPJIEBA_SEGMENTBASE_H
#include "globals.h"
#include "ISegment.hpp"
#include "ChineseFilter.hpp"
#include "Limonp/str_functs.hpp"
#include "Limonp/logger.hpp"
namespace CppJieba
{
using namespace Limonp;
class SegmentBase: public ISegment
{
public:
SegmentBase(){_setInitFlag(false);};
virtual ~SegmentBase(){};
private:
bool _isInited;
protected:
bool _getInitFlag()const{return _isInited;};
bool _setInitFlag(bool flag){return _isInited = flag;};
bool cut(const string& str, vector<string>& res)const
{
if(!_getInitFlag())
{
LogError("not inited.");
return false;
}
ChineseFilter filter;
filter.feed(str);
for(ChineseFilter::iterator it = filter.begin(); it != filter.end(); it++)
{
if(it.charType == CHWORD)
{
cut(it.begin, it.end, res);
}
else
{
string tmp;
if(TransCode::encode(it.begin, it.end, tmp))
{
res.push_back(tmp);
}
}
}
return true;
}
bool cut(Unicode::const_iterator begin, Unicode::const_iterator end, vector<string>& res)const = 0;
};
}
#endif

View File

@ -1,94 +0,0 @@
/************************************
* file enc : utf-8
* author : wuyanyi09@gmail.com
************************************/
#ifndef CPPJIEBA_TRANSCODE_H
#define CPPJIEBA_TRANSCODE_H
#include "globals.h"
#include "Limonp/str_functs.hpp"
namespace CppJieba
{
using namespace Limonp;
namespace TransCode
{
inline bool decode(const string& str, vector<uint16_t>& vec)
{
char ch1, ch2;
if(str.empty())
{
return false;
}
vec.clear();
size_t siz = str.size();
for(uint i = 0;i < siz;)
{
if(!(str[i] & 0x80)) // 0xxxxxxx
{
vec.push_back(str[i]);
i++;
}
else if ((unsigned char)str[i] <= 0xdf && i + 1 < siz) // 110xxxxxx
{
ch1 = (str[i] >> 2) & 0x07;
ch2 = (str[i+1] & 0x3f) | ((str[i] & 0x03) << 6 );
vec.push_back(twocharToUint16(ch1, ch2));
i += 2;
}
else if((unsigned char)str[i] <= 0xef && i + 2 < siz)
{
ch1 = (str[i] << 4) | ((str[i+1] >> 2) & 0x0f );
ch2 = ((str[i+1]<<6) & 0xc0) | (str[i+2] & 0x3f);
vec.push_back(twocharToUint16(ch1, ch2));
i += 3;
}
else
{
return false;
}
}
return true;
}
inline bool encode(vector<uint16_t>::const_iterator begin, vector<uint16_t>::const_iterator end, string& res)
{
if(begin >= end)
{
return false;
}
res.clear();
uint16_t ui;
while(begin != end)
{
ui = *begin;
if(ui <= 0x7f)
{
res += char(ui);
}
else if(ui <= 0x7ff)
{
res += char(((ui>>6) & 0x1f) | 0xc0);
res += char((ui & 0x3f) | 0x80);
}
else
{
res += char(((ui >> 12) & 0x0f )| 0xe0);
res += char(((ui>>6) & 0x3f )| 0x80 );
res += char((ui & 0x3f) | 0x80);
}
begin ++;
}
return true;
}
inline bool encode(const vector<uint16_t>& sentence, string& res)
{
return encode(sentence.begin(), sentence.end(), res);
}
}
}
#endif

View File

@ -1,390 +0,0 @@
/************************************
* file enc : ASCII
* author : wuyanyi09@gmail.com
************************************/
#include "Trie.h"
namespace CppJieba
{
Trie::Trie()
{
_root = NULL;
_freqSum = 0;
_minLogFreq = MAX_DOUBLE;
_initFlag = false;
}
Trie::~Trie()
{
dispose();
}
bool Trie::init()
{
if(_getInitFlag())
{
LogError("already initted!");
return false;
}
try
{
_root = new TrieNode;
}
catch(const bad_alloc& e)
{
return false;
}
if(NULL == _root)
{
return false;
}
_setInitFlag(true);
return true;
}
bool Trie::loadDict(const char * const filePath)
{
if(!_getInitFlag())
{
LogError("not initted.");
return false;
}
if(!checkFileExist(filePath))
{
LogError("cann't find fiel[%s].",filePath);
return false;
}
bool res = false;
res = _trieInsert(filePath);
if(!res)
{
LogError("_trieInsert failed.");
return false;
}
res = _countWeight();
if(!res)
{
LogError("_countWeight failed.");
return false;
}
return true;
}
bool Trie::_trieInsert(const char * const filePath)
{
ifstream ifile(filePath);
string line;
vector<string> vecBuf;
TrieNodeInfo nodeInfo;
while(getline(ifile, line))
{
vecBuf.clear();
splitStr(line, vecBuf, " ");
if(3 < vecBuf.size())
{
LogError("line[%s] illegal.", line.c_str());
return false;
}
if(!TransCode::decode(vecBuf[0], nodeInfo.word))
{
return false;
}
nodeInfo.freq = atoi(vecBuf[1].c_str());
if(3 == vecBuf.size())
{
nodeInfo.tag = vecBuf[2];
}
//insert node
if(!insert(nodeInfo))
{
LogError("insert node failed!");
}
}
return true;
}
bool Trie::dispose()
{
if(!_getInitFlag())
{
return false;
}
bool ret = _deleteNode(_root);
if(!ret)
{
LogFatal("_deleteNode failed!");
return false;
}
_root = NULL;
_nodeInfoVec.clear();
_setInitFlag(false);
return ret;
}
const TrieNodeInfo* Trie::findPrefix(const string& str)const
{
if(!_getInitFlag())
{
LogFatal("trie not initted!");
return NULL;
}
Unicode uintVec;
if(!TransCode::decode(str, uintVec))
{
LogError("TransCode::decode failed.");
return NULL;
}
//find
TrieNode* p = _root;
uint pos = 0;
uint16_t chUni = 0;
const TrieNodeInfo * res = NULL;
for(uint i = 0; i < uintVec.size(); i++)
{
chUni = uintVec[i];
if(p->isLeaf)
{
pos = p->nodeInfoVecPos;
if(pos >= _nodeInfoVec.size())
{
LogFatal("node's nodeInfoVecPos is out of _nodeInfoVec's range");
return NULL;
}
res = &(_nodeInfoVec[pos]);
}
if(p->hmap.find(chUni) == p->hmap.end())
{
break;
}
else
{
p = p->hmap[chUni];
}
}
return res;
}
const TrieNodeInfo* Trie::find(const string& str)const
{
Unicode uintVec;
if(!TransCode::decode(str, uintVec))
{
return NULL;
}
return find(uintVec);
}
const TrieNodeInfo* Trie::find(const Unicode& uintVec)const
{
if(uintVec.empty())
{
return NULL;
}
return find(uintVec.begin(), uintVec.end());
}
const TrieNodeInfo* Trie::find(Unicode::const_iterator begin, Unicode::const_iterator end)const
{
if(!_getInitFlag())
{
LogFatal("trie not initted!");
return NULL;
}
if(begin >= end)
{
return NULL;
}
TrieNode* p = _root;
for(Unicode::const_iterator it = begin; it != end; it++)
{
uint16_t chUni = *it;
if(p->hmap.find(chUni) == p-> hmap.end())
{
return NULL;
}
else
{
p = p->hmap[chUni];
}
}
if(p->isLeaf)
{
uint pos = p->nodeInfoVecPos;
if(pos < _nodeInfoVec.size())
{
return &(_nodeInfoVec[pos]);
}
else
{
LogFatal("node's nodeInfoVecPos is out of _nodeInfoVec's range");
return NULL;
}
}
return NULL;
}
bool Trie::find(const Unicode& unico, vector<pair<uint, const TrieNodeInfo*> >& res)const
{
if(!_getInitFlag())
{
LogFatal("trie not initted!");
return false;
}
TrieNode* p = _root;
//for(Unicode::const_iterator it = begin; it != end; it++)
for(uint i = 0; i < unico.size(); i++)
{
if(p->hmap.find(unico[i]) == p-> hmap.end())
{
break;
}
p = p->hmap[unico[i]];
if(p->isLeaf)
{
uint pos = p->nodeInfoVecPos;
if(pos < _nodeInfoVec.size())
{
res.push_back(make_pair(i, &_nodeInfoVec[pos]));
}
else
{
LogFatal("node's nodeInfoVecPos is out of _nodeInfoVec's range");
return false;
}
}
}
return !res.empty();
}
bool Trie::_deleteNode(TrieNode* node)
{
for(TrieNodeMap::iterator it = node->hmap.begin(); it != node->hmap.end(); it++)
{
TrieNode* next = it->second;
_deleteNode(next);
}
delete node;
return true;
}
bool Trie::insert(const TrieNodeInfo& nodeInfo)
{
if(!_getInitFlag())
{
LogFatal("not initted!");
return false;
}
const Unicode& uintVec = nodeInfo.word;
TrieNode* p = _root;
for(uint i = 0; i < uintVec.size(); i++)
{
uint16_t cu = uintVec[i];
if(NULL == p)
{
return false;
}
if(p->hmap.end() == p->hmap.find(cu))
{
TrieNode * next = NULL;
try
{
next = new TrieNode;
}
catch(const bad_alloc& e)
{
return false;
}
p->hmap[cu] = next;
p = next;
}
else
{
p = p->hmap[cu];
}
}
if(NULL == p)
{
return false;
}
if(p->isLeaf)
{
LogError("this node already inserted");
return false;
}
p->isLeaf = true;
_nodeInfoVec.push_back(nodeInfo);
p->nodeInfoVecPos = _nodeInfoVec.size() - 1;
return true;
}
bool Trie::_countWeight()
{
if(_nodeInfoVec.empty() || 0 != _freqSum)
{
LogError("_nodeInfoVec is empty or _freqSum has been counted already.");
return false;
}
//freq total freq
for(size_t i = 0; i < _nodeInfoVec.size(); i++)
{
_freqSum += _nodeInfoVec[i].freq;
}
if(0 == _freqSum)
{
LogError("_freqSum == 0 .");
return false;
}
//normalize
for(uint i = 0; i < _nodeInfoVec.size(); i++)
{
TrieNodeInfo& nodeInfo = _nodeInfoVec[i];
if(0 == nodeInfo.freq)
{
LogFatal("nodeInfo.freq == 0!");
return false;
}
nodeInfo.logFreq = log(double(nodeInfo.freq)/double(_freqSum));
if(_minLogFreq > nodeInfo.logFreq)
{
_minLogFreq = nodeInfo.logFreq;
}
}
return true;
}
}
#ifdef TRIE_UT
using namespace CppJieba;
int main()
{
Trie trie;
trie.init();
trie.loadDict("../dicts/segdict.gbk.v2.1");
//trie.loadDict("tmp");
cout<<trie.getMinLogFreq()<<endl;
trie.dispose();
return 0;
}
#endif

View File

@ -1,85 +0,0 @@
/************************************
* file enc : ASCII
* author : wuyanyi09@gmail.com
************************************/
#ifndef CPPJIEBA_TRIE_H
#define CPPJIEBA_TRIE_H
#include <iostream>
#include <fstream>
#include <map>
#include <cstring>
#include <stdint.h>
#include <cmath>
#include <limits>
#include "Limonp/str_functs.hpp"
#include "Limonp/logger.hpp"
#include "TransCode.hpp"
#include "globals.h"
#include "structs.h"
namespace CppJieba
{
using namespace Limonp;
struct TrieNode
{
TrieNodeMap hmap;
bool isLeaf;
uint nodeInfoVecPos;
TrieNode()
{
isLeaf = false;
nodeInfoVecPos = 0;
}
};
class Trie
{
private:
TrieNode* _root;
vector<TrieNodeInfo> _nodeInfoVec;
bool _initFlag;
int64_t _freqSum;
double _minLogFreq;
public:
Trie();
~Trie();
bool init();
bool loadDict(const char * const filePath);
bool dispose();
private:
void _setInitFlag(bool on){_initFlag = on;};
bool _getInitFlag()const{return _initFlag;};
public:
const TrieNodeInfo* find(const string& str)const;
const TrieNodeInfo* find(const Unicode& uintVec)const;
const TrieNodeInfo* find(Unicode::const_iterator begin, Unicode::const_iterator end)const;
bool find(const Unicode& unico, vector<pair<uint, const TrieNodeInfo*> >& res)const;
const TrieNodeInfo* findPrefix(const string& str)const;
public:
//double getWeight(const string& str);
//double getWeight(const Unicode& uintVec);
//double getWeight(Unicode::const_iterator begin, Unicode::const_iterator end);
double getMinLogFreq()const{return _minLogFreq;};
//int64_t getTotalCount(){return _freqSum;};
bool insert(const TrieNodeInfo& nodeInfo);
private:
bool _trieInsert(const char * const filePath);
bool _countWeight();
bool _deleteNode(TrieNode* node);
};
}
#endif

View File

@ -1,36 +0,0 @@
/************************************
* file enc : ASCII
* author : wuyanyi09@gmail.com
************************************/
#ifndef CPPJIEBA_GLOBALS_H
#define CPPJIEBA_GLOBALS_H
#include <map>
#include <vector>
#include <string>
#include <sys/types.h>
#include <stdint.h>
//#include <hash_map>
#include <tr1/unordered_map>
//#include <ext/hash_map>
namespace CppJieba
{
using namespace std;
using std::tr1::unordered_map;
//using __gnu_cxx::hash_map;
//using namespace stdext;
//typedefs
typedef std::vector<std::string>::iterator VSI;
typedef std::vector<uint16_t> Unicode;
typedef Unicode::const_iterator UniConIter;
typedef unordered_map<uint16_t, struct TrieNode*> TrieNodeMap;
typedef unordered_map<uint16_t, double> EmitProbMap;
const double MIN_DOUBLE = -3.14e+100;
const double MAX_DOUBLE = 3.14e+100;
enum CHAR_TYPE { CHWORD = 0, DIGIT_OR_LETTER = 1, OTHERS = 2};
}
#endif

View File

@ -1,82 +0,0 @@
#include <iostream>
#include <fstream>
#include "Limonp/ArgvContext.hpp"
#include "MPSegment.h"
#include "HMMSegment.h"
#include "MixSegment.h"
using namespace CppJieba;
void cut(const ISegment * seg, const char * const filePath)
{
ifstream ifile(filePath);
vector<string> res;
string line;
while(getline(ifile, line))
{
if(!line.empty())
{
res.clear();
seg->cut(line, res);
cout<<join(res.begin(), res.end(),"/")<<endl;
}
}
}
int main(int argc, char ** argv)
{
if(argc < 2)
{
cout<<"usage: \n\t"<<argv[0]<<" [options] <filename>\n"
<<"options:\n"
<<"\t--algorithm\tSupported methods are [cutDAG, cutHMM, cutMix] for now. \n\t\t\tIf not specified, the default is cutMix\n"
<<"\t--dictpath\tsee example\n"
<<"\t--modelpath\tsee example\n"
<<"example:\n"
<<"\t"<<argv[0]<<" testlines.utf8 --dictpath dicts/jieba.dict.utf8\n"
<<"\t"<<argv[0]<<" testlines.utf8 --modelpath dicts/hmm_model.utf8 --algorithm cutHMM\n"
<<"\t"<<argv[0]<<" testlines.utf8 --dictpath dicts/jieba.dict.utf8 --modelpath dicts/hmm_model.utf8 --algorithm cutMix\n"
<<endl;
return EXIT_FAILURE;
}
ArgvContext arg(argc, argv);
string dictPath = arg["--dictpath"];
string modelPath = arg["--modelpath"];
string algorithm = arg["--algorithm"];
if("cutHMM" == algorithm)
{
HMMSegment seg;
if(!seg.init(modelPath.c_str()))
{
cout<<"seg init failed."<<endl;
return EXIT_FAILURE;
}
cut(&seg, arg[1].c_str());
seg.dispose();
}
else if("cutDAG" == algorithm)
{
MPSegment seg;
if(!seg.init(dictPath.c_str()))
{
cout<<"seg init failed."<<endl;
return false;
}
cut(&seg, arg[1].c_str());
seg.dispose();
}
else
{
MixSegment seg;
if(!seg.init(dictPath.c_str(), modelPath.c_str()))
{
cout<<"seg init failed."<<endl;
return EXIT_FAILURE;
}
cut(&seg, arg[1].c_str());
seg.dispose();
}
return EXIT_SUCCESS;
}

View File

@ -1,116 +0,0 @@
#include <unistd.h>
#include <algorithm>
#include <string>
#include <ctype.h>
#include <string.h>
#include "Limonp/ArgvContext.hpp"
#include "Limonp/Config.hpp"
#include "Husky/Daemon.h"
#include "Husky/ServerFrame.h"
#include "MPSegment.h"
#include "HMMSegment.h"
#include "MixSegment.h"
using namespace Husky;
using namespace CppJieba;
class ReqHandler: public IRequestHandler
{
private:
string _dictPath;
string _modelPath;
public:
ReqHandler(const string& dictPath, const string& modelPath): _dictPath(dictPath), _modelPath(modelPath){};
virtual ~ReqHandler(){};
virtual bool init(){return _segment.init(_dictPath.c_str(), _modelPath.c_str());};
virtual bool dispose(){return _segment.dispose();};
public:
virtual bool do_GET(const HttpReqInfo& httpReq, string& strSnd)
{
string sentence, tmp;
vector<string> words;
httpReq.GET("key", tmp);
URLDecode(tmp, sentence);
_segment.cut(sentence, words);
strSnd << words;
return true;
}
private:
MixSegment _segment;
};
bool run(int argc, char** argv)
{
if(argc < 2)
{
return false;
}
ArgvContext arg(argc, argv);
if(arg["-c"].empty())
{
return false;
}
Config conf;
if(!conf.loadFile(arg["-c"].c_str()))
{
return false;
}
unsigned int port = 0;
unsigned int threadNum = 0;
string pidFile;
string dictPath;
string modelPath;
string val;
if(!conf.get("port", val))
{
LogFatal("conf get port failed.");
return false;
}
port = atoi(val.c_str());
if(!conf.get("thread_num", val))
{
LogFatal("conf get thread_num failed.");
return false;
}
threadNum = atoi(val.c_str());
if(!conf.get("pid_file", pidFile))
{
LogFatal("conf get pid_file failed.");
return false;
}
if(!conf.get("dict_path", dictPath))
{
LogFatal("conf get dict_path failed.");
return false;
}
if(!conf.get("model_path", modelPath))
{
LogFatal("conf get model_path failed.");
return false;
}
ReqHandler reqHandler(dictPath, modelPath);
ServerFrame sf(port, threadNum, &reqHandler);
Daemon daemon(&sf, pidFile.c_str());
if(arg["-k"] == "start")
{
return daemon.start();
}
else if(arg["-k"] == "stop")
{
return daemon.stop();
}
return false;
}
int main(int argc, char* argv[])
{
if(!run(argc, argv))
{
printf("usage: %s -c <config_file> -k <start|stop>\n", argv[0]);
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}

View File

@ -1,111 +0,0 @@
#ifndef CPPJIEBA_STRUCTS_H
#define CPPJIEBA_STRUCTS_H
#include <limits>
#include "globals.h"
#include "Trie.h"
#include "TransCode.hpp"
namespace CppJieba
{
struct TrieNodeInfo
{
//string word;
//size_t wLen;// the word's len , not string.length(),
Unicode word;
size_t freq;
string tag;
double logFreq; //logFreq = log(freq/sum(freq));
TrieNodeInfo():freq(0),logFreq(0.0)
{
}
TrieNodeInfo(const TrieNodeInfo& nodeInfo):word(nodeInfo.word), freq(nodeInfo.freq), tag(nodeInfo.tag), logFreq(nodeInfo.logFreq)
{
}
TrieNodeInfo(const Unicode& _word):word(_word),freq(0),logFreq(MIN_DOUBLE)
{
}
string toString()const
{
string tmp;
TransCode::encode(word, tmp);
return string_format("{word:%s,freq:%d, logFreq:%lf}", tmp.c_str(), freq, logFreq);
}
};
typedef unordered_map<uint, const TrieNodeInfo*> DagType;
struct SegmentChar
{
uint16_t uniCh;
DagType dag;
const TrieNodeInfo * pInfo;
double weight;
SegmentChar(uint16_t uni):uniCh(uni), pInfo(NULL), weight(0.0)
{
}
/*const TrieNodeInfo* pInfo;
double weight;
SegmentChar(uint16_t unich, const TrieNodeInfo* p, double w):uniCh(unich), pInfo(p), weight(w)
{
}*/
};
/*
struct SegmentContext
{
vector<SegmentChar> context;
bool getDA
};*/
typedef vector<SegmentChar> SegmentContext;
struct KeyWordInfo: public TrieNodeInfo
{
double idf;
double weight;// log(wLen+1)*logFreq;
KeyWordInfo():idf(0.0),weight(0.0)
{
}
KeyWordInfo(const Unicode& _word):TrieNodeInfo(_word),idf(0.0),weight(0.0)
{
}
KeyWordInfo(const TrieNodeInfo& trieNodeInfo):TrieNodeInfo(trieNodeInfo)
{
}
string toString() const
{
string tmp;
TransCode::encode(word, tmp);
return string_format("{word:%s,weight:%lf, idf:%lf}", tmp.c_str(), weight, idf);
}
KeyWordInfo& operator = (const TrieNodeInfo& trieNodeInfo)
{
word = trieNodeInfo.word;
freq = trieNodeInfo.freq;
tag = trieNodeInfo.tag;
logFreq = trieNodeInfo.logFreq;
return *this;
}
};
inline ostream& operator << (ostream& os, const KeyWordInfo& info)
{
string tmp;
TransCode::encode(info.word, tmp);
return os << "{words:" << tmp << ", weight:" << info.weight << ", idf:" << info.idf << "}";
}
//inline string joinWordInfos(const vector<KeyWordInfo>& vec)
//{
// vector<string> tmp;
// for(uint i = 0; i < vec.size(); i++)
// {
// tmp.push_back(vec[i].toString());
// }
// return joinStr(tmp, ",");
//}
}
#endif

12
test/CMakeLists.txt Normal file
View File

@ -0,0 +1,12 @@
SET(EXECUTABLE_OUTPUT_PATH ${PROJECT_BINARY_DIR})
# Configure test paths
configure_file("${CMAKE_CURRENT_SOURCE_DIR}/test_paths.h.in" "${CMAKE_BINARY_DIR}/test/test_paths.h")
INCLUDE_DIRECTORIES(
${CMAKE_CURRENT_BINARY_DIR}
${CMAKE_BINARY_DIR}/test
)
ADD_EXECUTABLE(load_test load_test.cpp)
ADD_SUBDIRECTORY(unittest)

View File

@ -1,26 +0,0 @@
#include <ChineseFilter.h>
#ifdef UT
using namespace CppJieba;
int main(int argc, char** argv)
{
ChineseFilter chFilter;
ifstream ifs("../demo/testlines.utf8");
string line;
while(getline(ifs, line))
{
chFilter.feed(line);
for(ChineseFilter::iterator it = chFilter.begin(); it != chFilter.end(); it++)
{
//cout<<__FILE__<<__LINE__<<endl;
string tmp;
TransCode::encode(it.begin, it.end, tmp);
cout<<tmp<<endl;
}
}
return 0;
}
#endif

View File

@ -1 +0,0 @@
curl "http://127.0.0.1:11200/?key=南京市长江大桥"

58
test/load_test.cpp Normal file
View File

@ -0,0 +1,58 @@
#include <iostream>
#include <ctime>
#include <fstream>
#include "cppjieba/MPSegment.hpp"
#include "cppjieba/HMMSegment.hpp"
#include "cppjieba/MixSegment.hpp"
#include "cppjieba/KeywordExtractor.hpp"
#include "limonp/Colors.hpp"
#include "test_paths.h"
using namespace cppjieba;
void Cut(size_t times = 50) {
MixSegment seg(DICT_DIR "/jieba.dict.utf8", DICT_DIR "/hmm_model.utf8");
vector<string> res;
string doc;
ifstream ifs(TEST_DATA_DIR "/weicheng.utf8");
assert(ifs);
doc << ifs;
long beginTime = clock();
for (size_t i = 0; i < times; i ++) {
printf("process [%3.0lf %%]\r", 100.0*(i+1)/times);
fflush(stdout);
res.clear();
seg.Cut(doc, res);
}
printf("\n");
long endTime = clock();
ColorPrintln(GREEN, "Cut: [%.3lf seconds]time consumed.", double(endTime - beginTime)/CLOCKS_PER_SEC);
}
void Extract(size_t times = 400) {
KeywordExtractor Extractor(DICT_DIR "/jieba.dict.utf8",
DICT_DIR "/hmm_model.utf8",
DICT_DIR "/idf.utf8",
DICT_DIR "/stop_words.utf8");
vector<string> words;
string doc;
ifstream ifs(TEST_DATA_DIR "/review.100");
assert(ifs);
doc << ifs;
long beginTime = clock();
for (size_t i = 0; i < times; i ++) {
printf("process [%3.0lf %%]\r", 100.0*(i+1)/times);
fflush(stdout);
words.clear();
Extractor.Extract(doc, words, 5);
}
printf("\n");
long endTime = clock();
ColorPrintln(GREEN, "Extract: [%.3lf seconds]time consumed.", double(endTime - beginTime)/CLOCKS_PER_SEC);
}
int main(int argc, char ** argv) {
Cut();
Extract();
return EXIT_SUCCESS;
}

View File

@ -1 +0,0 @@
g++ -o segment.demo segment.cpp -std=c++0x -L/usr/lib/CppJieba -lcppjieba

View File

@ -1,60 +0,0 @@
#include <iostream>
#include <fstream>
#include <CppJieba/Limonp/ArgvContext.hpp>
#include <CppJieba/MPSegment.h>
#include <CppJieba/HMMSegment.h>
#include <CppJieba/MixSegment.h>
using namespace CppJieba;
void cut(const ISegment * seg, const char * const filePath)
{
ifstream ifile(filePath);
vector<string> res;
string line;
while(getline(ifile, line))
{
if(!line.empty())
{
res.clear();
seg->cut(line, res);
cout<<join(res.begin(), res.end(),"/")<<endl;
}
}
}
int main(int argc, char ** argv)
{
//demo
{
HMMSegment seg;
if(!seg.init("../dicts/hmm_model.utf8"))
{
cout<<"seg init failed."<<endl;
return EXIT_FAILURE;
}
cut(&seg, "testlines.utf8");
seg.dispose();
}
{
MixSegment seg;
if(!seg.init("../dicts/jieba.dict.utf8", "../dicts/hmm_model.utf8"))
{
cout<<"seg init failed."<<endl;
return EXIT_FAILURE;
}
cut(&seg, "testlines.utf8");
seg.dispose();
}
{
MPSegment seg;
if(!seg.init("../dicts/jieba.dict.utf8"))
{
cout<<"seg init failed."<<endl;
return false;
}
cut(&seg, "testlines.utf8");
seg.dispose();
}
return EXIT_SUCCESS;
}

View File

@ -1,58 +0,0 @@
#include <CppJieba/Husky/ServerFrame.h>
#include <CppJieba/Husky/Daemon.h>
#include <CppJieba/Limonp/ArgvContext.hpp>
#include <CppJieba/MPSegment.h>
#include <CppJieba/HMMSegment.h>
#include <CppJieba/MixSegment.h>
using namespace Husky;
using namespace CppJieba;
const char * const DEFAULT_DICTPATH = "../dicts/jieba.dict.utf8";
const char * const DEFAULT_MODELPATH = "../dicts/hmm_model.utf8";
class ServerDemo: public IRequestHandler
{
public:
ServerDemo(){};
virtual ~ServerDemo(){};
virtual bool init(){return _segment.init(DEFAULT_DICTPATH, DEFAULT_MODELPATH);};
virtual bool dispose(){return _segment.dispose();};
public:
virtual bool do_GET(const HttpReqInfo& httpReq, string& strSnd)
{
string sentence, tmp;
vector<string> words;
httpReq.GET("key", tmp);
URLDecode(tmp, sentence);
_segment.cut(sentence, words);
strSnd << words;
return true;
}
private:
MixSegment _segment;
};
int main(int argc,char* argv[])
{
if(argc != 7)
{
printf("usage: %s -n THREAD_NUMBER -p LISTEN_PORT -k start|stop\n",argv[0]);
return -1;
}
ArgvContext arg(argc, argv);
unsigned int port = atoi(arg["-p"].c_str());
unsigned int threadNum = atoi(arg["-n"].c_str());
ServerDemo s;
Daemon daemon(&s);
if(arg["-k"] == "start")
{
return !daemon.Start(port, threadNum);
}
else
{
return !daemon.Stop();
}
}

7
test/test_paths.h.in Normal file
View File

@ -0,0 +1,7 @@
#ifndef TEST_PATHS_H
#define TEST_PATHS_H
#define TEST_DATA_DIR "@CMAKE_CURRENT_SOURCE_DIR@/testdata"
#define DICT_DIR "@CMAKE_SOURCE_DIR@/dict"
#endif // TEST_PATHS_H

1
test/testdata/curl.res vendored Normal file
View File

@ -0,0 +1 @@
["南京市", "长江大桥"]

File diff suppressed because it is too large Load Diff

93
test/testdata/jieba.dict.0.1.utf8 vendored Normal file
View File

@ -0,0 +1,93 @@
龙鸣狮吼 3 nr
龙齐诺 2 nr
龙齿 3 n
龚 176 nr
龚世萍 2 nr
龚书铎 2 nr
龚二人 2 nr
龚云甫 3 nr
龚伟强 5 nr
龚先生 4 nr
龚光杰 44 nr
龚古尔 24 nr
龚子敬 2 nr
龚孝升 12 nr
龚学平 2 nr
龚完敬 5 nr
龚定庵 3 nr
龚定敬 2 nr
龚宝铨 5 nr
龚家村 3 nr
龚建国 29 nr
龚德俊 6 nr
龚心瀚 3 nr
龚志国 2 nr
龚意田 3 nr
龚慈恩 3 nr
龚施茜 3 nr
龚晓犁 2 nr
龚普洛 3 nr
龚智超 7 nr
龚松林 10 nr
龚永明 3 nr
龚永泉 5 nr
龚泽艺 256 nr
龚睿 8 nrfg
龚祖同 2 nr
龚秋婷 3 nr
龚老爷 2 nr
龚育之 19 nr
龚自珍 28 nr
龚蓓苾 3 nr
龚虹嘉 3 nr
龚诗嘉 3 nr
龛 223 ng
龜 2 zg
龟 903 ns
龟儿子 123 n
龟兆 3 nz
龟兹 215 ns
龟兹王 3 nrt
龟冷搘床 3 v
龟冷支床 3 n
龟卜 3 n
龟厌不告 3 l
龟壳 33 n
龟壳花 3 n
龟头 34 n
龟头炎 3 n
龟山 23 ns
龟山乡 3 ns
龟山岛 3 ns
龟年鹤寿 3 ns
龟年鹤算 3 l
龟文 3 nz
龟文写迹 3 n
龟文鸟迹 3 n
龟板 10 n
龟毛免角 3 n
龟毛兔角 3 n
龟溪 3 ns
龟玉 3 nz
龟王 3 nz
龟甲 92 ns
龟甲胶 3 nz
龟筮 3 n
龟纹 3 n
龟缩 29 v
龟肉 3 n
龟背 21 n
龟背竹 3 n
龟苓膏 3 n
龟苗 3 n
龟裂 34 v
龟足 5 v
龟鉴 2 n
龟镜 3 nz
龟鳖 3 n
龟鹤遐寿 3 l
龟龄鹤算 3 n
龟龙片甲 3 nz
龟龙麟凤 3 ns
龠 5 g
龢 732 zg

93
test/testdata/jieba.dict.0.utf8 vendored Normal file
View File

@ -0,0 +1,93 @@
龙鸣狮吼 3 nr
龙齐诺 2 nr
龙齿 3 n
龚 176 nr
龚世萍 2 nr
龚书铎 2 nr
龚二人 2 nr
龚云甫 3 nr
龚伟强 5 nr
龚先生 4 nr
龚光杰 44 nr
龚古尔 24 nr
龚子敬 2 nr
龚孝升 12 nr
龚学平 2 nr
龚完敬 5 nr
龚定庵 3 nr
龚定敬 2 nr
龚宝铨 5 nr
龚家村 3 nr
龚建国 29 nr
龚德俊 6 nr
龚心瀚 3 nr
龚志国 2 nr
龚意田 3 nr
龚慈恩 3 nr
龚施茜 3 nr
龚晓犁 2 nr
龚普洛 3 nr
龚智超 7 nr
龚松林 10 nr
龚永明 3 nr
龚永泉 5 nr
龚泽艺 256 nr
龚睿 8 nrfg
龚祖同 2 nr
龚秋婷 3 nr
龚老爷 2 nr
龚育之 19 nr
龚自珍 28 nr
龚蓓苾 3 nr
龚虹嘉 3 nr
龚诗嘉 3 nr
龛 223 ng
龜 2 zg
龟 903 ns
龟儿子 123 n
龟兆 3 nz
龟兹 215 ns
龟兹王 3 nrt
龟冷搘床 3 v
龟冷支床 3 n
龟卜 3 n
龟厌不告 3 l
龟壳 33 n
龟壳花 3 n
龟头 34 n
龟头炎 3 n
龟山 23 ns
龟山乡 3 ns
龟山岛 3 ns
龟年鹤寿 3 ns
龟年鹤算 3 l
龟文 3 nz
龟文写迹 3 n
龟文鸟迹 3 n
龟板 10 n
龟毛免角 3 n
龟毛兔角 3 n
龟溪 3 ns
龟玉 3 nz
龟王 3 nz
龟甲 92 ns
龟甲胶 3 nz
龟筮 3 n
龟纹 3 n
龟缩 29 v
龟肉 3 n
龟背 21 n
龟背竹 3 n
龟苓膏 3 n
龟苗 3 n
龟裂 34 v
龟足 5 v
龟鉴 2 n
龟镜 3 nz
龟鳖 3 n
龟鹤遐寿 3 l
龟龄鹤算 3 n
龟龙片甲 3 nz
龟龙麟凤 3 ns
龠 5 g
龢 732 zg

67
test/testdata/jieba.dict.1.utf8 vendored Normal file
View File

@ -0,0 +1,67 @@
AT&T 3 nz
B超 3 n
c# 3 nz
C# 3 nz
c++ 3 nz
C++ 3 nz
T恤 4 n
一 217830 m
一一 1670 m
一一二 11 m
一一例 3 m
一一分 8 m
一一列举 34 i
一一对 9 m
一一对应 43 l
一一记 2 m
一一道来 4 l
一丁 18 d
一丁不识 3 i
一丁点 3 m
一丁点儿 24 m
一七 22 m
一七八不 3 l
一万 442 m
一万一千 4 m
一万一千五百二十颗 2 m
一万一千八百八十斤 2 m
一万一千多间 2 m
一万一千零九十五册 4 m
一万七千 5 m
一万七千余 2 m
一万七千多 2 m
一万七千多户 2 m
一万万 4 m
一万万两 4 m
一万三千 8 m
一万三千五百一十七 2 m
一万三千五百斤 4 m
一万三千余种 2 m
一万三千块 2 m
一万两 124 m
一万两万 4 m
一万两千 3 m
一万个 62 m
一万九千 2 m
一万九千余 2 m
一万二 10 m
一万二千 7 m
一万二千两 2 m
一万二千五百 4 m
一万二千五百一十二 2 m
一万二千五百余 2 m
一万二千五百余吨 2 m
一万二千亩 2 m
一万二千余 2 m
一万二千六百八十二箱 2 m
一万二千名 3 m
一万二千里 3 m
一万五 6 m
一万五千 45 m
一万五千一百四十四卷 2 m
一万五千两 4 m
一万五千个 2 m
一万五千二百余 2 m
一万五千余 9 m
一万五千元 3 m
一万五千名 4 m

64
test/testdata/jieba.dict.2.utf8 vendored Normal file
View File

@ -0,0 +1,64 @@
一万七千 5 m
一万七千余 2 m
一万七千多 2 m
一万七千多户 2 m
一万万 4 m
一万万两 4 m
一万三千 8 m
一万三千五百一十七 2 m
一万三千五百斤 4 m
一万三千余种 2 m
一万三千块 2 m
一万两 124 m
一万两万 4 m
一万两千 3 m
一万个 62 m
一万九千 2 m
一万九千余 2 m
一万二 10 m
一万二千 7 m
一万二千两 2 m
一万二千五百 4 m
一万二千五百一十二 2 m
一万二千五百余 2 m
一万二千五百余吨 2 m
一万二千亩 2 m
一万二千余 2 m
一万二千六百八十二箱 2 m
一万二千名 3 m
一万二千里 3 m
一万五 6 m
一万五千 45 m
一万五千一百四十四卷 2 m
一万五千两 4 m
一万五千个 2 m
一万五千二百余 2 m
一万五千余 9 m
一万五千元 3 m
一万五千名 4 m
一万五千多 2 m
一万五千家 2 m
一万亿 3 m
一万亿美元 5 m
一万余 41 m
一万余吨 2 m
一万余顷 2 m
一万倍 14 m
一万元 61 m
一万八 5 m
一万八千 7 m
一万八千余 8 m
一万八千多元 2 m
一万公里 2 m
一万六千 5 m
一万六千三百户 2 m
一万六千余户 2 m
一万六千多 3 m
一万册 2 m
一万刀 7 m
一万匹 4 m
一万卷 2 m
一万双 6 m
一万发 2 m
一万句 11 m
一万只 9 m

2
test/testdata/load_test.urls vendored Normal file
View File

@ -0,0 +1,2 @@
http://127.0.0.1:11200/?key=南京市长江大桥
http://127.0.0.1:11200/?key=长春市长春药店

100
test/testdata/review.100 vendored Normal file
View File

@ -0,0 +1,100 @@
标&#12288;&#12288;签:保湿还不错比商场便宜补水效果好乳液很好用是正品心&#12288;&#12288;得:感觉还蛮好吸收的,不错啦
标&#12288;&#12288;签:还可以心&#12288;&#12288;得:不错~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
标&#12288;&#12288;签:是正品心&#12288;&#12288;得:下次我还要咋京东这里买不错
标&#12288;&#12288;签:挺保湿的心&#12288;&#12288;得:价格实惠,适合夏天用,很轻薄
标&#12288;&#12288;签:皮肤滑滑的味道不错挺保湿的很好用物流速度快心&#12288;&#12288;得:使用的挺好的一直用着这个的
标&#12288;&#12288;签:价格实惠比商场便宜心&#12288;&#12288;得:不错不错,活动买的很划算
标&#12288;&#12288;签:吸收快品牌好是正品挺保湿的心&#12288;&#12288;得一直使用3年值得信赖好用
标&#12288;&#12288;签:是正品皮肤滑滑的补水效果好乳液很好用心&#12288;&#12288;得:不错不错老婆很喜欢我值
标&#12288;&#12288;签:保湿还不错心&#12288;&#12288;得:挺好的。。。。。。。。。。
标&#12288;&#12288;签:是正品很好用心&#12288;&#12288;得:一直在京东买,可以信赖
标&#12288;&#12288;签:是正品挺保湿的效果不错心&#12288;&#12288;得:送货快!是正品,大品牌的用的放心!
标&#12288;&#12288;签:乳液很好用心&#12288;&#12288;得:很好的东东,下次还会买
心&#12288;&#12288;得:送同学的,希望她喜欢
标&#12288;&#12288;签:价格实惠心&#12288;&#12288;得:一直用,还可以吧,性价比高
心&#12288;&#12288;得:不错够速度,效果也不错,希望大家用着也一样,顶顶顶
标&#12288;&#12288;签:挺保湿的心&#12288;&#12288;得:用着还不错。挺好的。
优&#12288;&#12288;点:东西很好哦!不&#12288;&#12288;足:暂时还没有发现缺点哦!心&#12288;&#12288;得:很好,也很划算
标&#12288;&#12288;签:脸上很舒服是正品心&#12288;&#12288;得:哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈和
优&#12288;&#12288;点:用了一下,感觉还不错不&#12288;&#12288;足:暂时还没有发现缺点哦!心&#12288;&#12288;得:用了一下,还可以
标&#12288;&#12288;签:品牌好心&#12288;&#12288;得:东西还行,就是线太少了
标&#12288;&#12288;签:还可以老婆买的心&#12288;&#12288;得:代买的,据说还不错,搞优惠屯着。
标&#12288;&#12288;签:保湿还不错很好用心&#12288;&#12288;得:一直在用这个,现在继续。
标&#12288;&#12288;签:很好用心&#12288;&#12288;得:正品,方便好用,比店里便宜
标&#12288;&#12288;签:保湿还不错妈妈买的比商场便宜挺保湿的吸收快心&#12288;&#12288;得:可以先去专柜试试~然后再京东上购买,由京东的发票,还是比较放心的~
心&#12288;&#12288;得:很好很滋润又不油
标&#12288;&#12288;签:吸收快脸上很舒服保湿还不错很好用比商场便宜心&#12288;&#12288;得用过几瓶了http://club.jd.com/JdVote/TradeComment.aspx?ruleid=586763684&ot=0#none感觉很不错不油腻吸收快还保湿。
标&#12288;&#12288;签:还可以心&#12288;&#12288;得:一般吧,还没怎么用。现在不知道效果。
标&#12288;&#12288;签:还可以心&#12288;&#12288;得:东西很不错。很好。很喜欢!
标&#12288;&#12288;签:比商场便宜价格实惠心&#12288;&#12288;得:一直都在用,没有刺激,很舒服,价格合适
标&#12288;&#12288;签:包装好服务好比商场便宜皮肤滑滑的很好用心&#12288;&#12288;得:送货速度也很快!非常好,质量不错,推荐购买!包装很好!
标&#12288;&#12288;签:吸收快服务好心&#12288;&#12288;得:质量不错,值得信赖,网购上京东,放心又轻松!
标&#12288;&#12288;签:味道不错吸收快心&#12288;&#12288;得:不油腻,味道也不错,美白效果嘛暂时没有,毕竟只用了几次而已。
标&#12288;&#12288;签:还可以价格实惠心&#12288;&#12288;得:还不错,促销活动买的.........
标&#12288;&#12288;签:挺保湿的效果不错脸上很舒服很好用心&#12288;&#12288;得帮朋友买的她觉得非常不错继续关注ZA
标&#12288;&#12288;签:乳液很好用心&#12288;&#12288;得:比较清爽,补水效果并不是很好,夏天用用吧
标&#12288;&#12288;签:是正品补水效果好还可以心&#12288;&#12288;得:补水效果不错,很好用
优&#12288;&#12288;点:东西很好哦!不&#12288;&#12288;足:暂时还没有发现缺点哦!心&#12288;&#12288;得:一直在用,信任京东,感觉不错,下次再来。。
标&#12288;&#12288;签:皮肤滑滑的味道不错价格实惠保湿还不错乳液很好用心&#12288;&#12288;得:用的很好的下次还会购买
标&#12288;&#12288;签:很好用皮肤滑滑的心&#12288;&#12288;得:好用啊,一如既往的好用
心&#12288;&#12288;得:买了以后就知道不后悔的呢
心&#12288;&#12288;得:非常满意,五星
心&#12288;&#12288;得:非常满意,五星
心&#12288;&#12288;得:非常满意,五星
心&#12288;&#12288;得:宝贝很喜欢,连作业都不肯做,在那儿看呢,呵呵
心&#12288;&#12288;得:非常满意,五星
心&#12288;&#12288;得:非常满意,五星
心&#12288;&#12288;得:非常满意,五星
心&#12288;&#12288;得:非常满意,五星
标&#12288;&#12288;签:服务好很好用心&#12288;&#12288;得:不错,正品,还会继续关注
标&#12288;&#12288;签:乳液很好用心&#12288;&#12288;得:比较滋润还不错。。。。。。。。。。
标&#12288;&#12288;签:品牌好心&#12288;&#12288;得:送货快,还没有用,具体效果还不清楚
标&#12288;&#12288;签:很好用心&#12288;&#12288;得:一直用这个,在京东买方便。
标&#12288;&#12288;签:保湿还不错包装好脸上很舒服吸收快物流速度快心&#12288;&#12288;得:必须要说的是,这是我老婆自己买的。
标&#12288;&#12288;签:效果不错心&#12288;&#12288;得:一直用这个存货中**************
标&#12288;&#12288;签:很好用心&#12288;&#12288;得:还可以,常规的东东。.
标&#12288;&#12288;签:包装好乳液很好用补水效果好物流速度快价格实惠心&#12288;&#12288;得:挺好的,脸上不紧绷,舒服
标&#12288;&#12288;签:物流速度快价格实惠心&#12288;&#12288;得:应该是正品吧,价格比超市便宜些。正在使用中
标&#12288;&#12288;签:还可以心&#12288;&#12288;得:挺滋润的,价钱也合适!
标&#12288;&#12288;签:是正品效果不错心&#12288;&#12288;得:用过以后效果挺好的,不错是正品
标&#12288;&#12288;签:很好用比商场便宜心&#12288;&#12288;得:用这个产品一年了,比较认可。
标&#12288;&#12288;签:保湿还不错心&#12288;&#12288;得:第一次用乳液,感觉还不错
标&#12288;&#12288;签:价格实惠心&#12288;&#12288;得:便宜,东西还行吧,用着不习惯,感觉有酒精
标&#12288;&#12288;签:价格实惠包装好心&#12288;&#12288;得:看牌子买的,先试着用用看效果
心&#12288;&#12288;得:配套用的不错个人觉得
标&#12288;&#12288;签:味道刺激心&#12288;&#12288;得:不怎么样,用后脸上会起红点
标&#12288;&#12288;签:挺保湿的物流速度快比商场便宜品牌好心&#12288;&#12288;得:正品,平价,比商场便宜,物流很快。
标&#12288;&#12288;签:服务好心&#12288;&#12288;得还没有使用过就发现YMX只要79元我哭为什么京东价格拼不过YMX呀~~~
标&#12288;&#12288;签:挺保湿的心&#12288;&#12288;得:第一次购买,用了感觉还不错
标&#12288;&#12288;签:服务好物流速度快脸上很舒服心&#12288;&#12288;得:刚送到家。。用用在发表好坏。
心&#12288;&#12288;得:还没用看看包装蛮好的晒&#12288;&#12288;单共3张图片查看晒单>
标&#12288;&#12288;签:品牌好价格实惠脸上很舒服味道不错心&#12288;&#12288;得:防晒,不油腻,还可以使皮肤稍稍增白些,
标&#12288;&#12288;签:价格实惠保湿还不错心&#12288;&#12288;得:东西好用,分不清楚是不是正品。
标&#12288;&#12288;签:服务好乳液很好用心&#12288;&#12288;得:乳液还是不错的用用不错的
标&#12288;&#12288;签:物流速度快效果不错心&#12288;&#12288;得:常用这个,夏天用,美白效果还好
标&#12288;&#12288;签:还可以心&#12288;&#12288;得:不错
标&#12288;&#12288;签:价格实惠比商场便宜服务好心&#12288;&#12288;得:真的还不错而且价格也实惠快递速度
标&#12288;&#12288;签:比商场便宜脸上很舒服很好用物流速度快是正品心&#12288;&#12288;得:京东就是好一日既往的好
活动时购买的很划算,用下来觉得还可以吧,等用完了才能知道有没有效果吧。反正很划算,随便用用看
新能真皙美白乳液很好用,有美白的效果,吸收也很快,搞活动买的,比外面便宜好多~~~~~
三八妇女节买的Z的产品随便用用可以的。女人要对自己好一点。
标&#12288;&#12288;签:是正品挺保湿的心&#12288;&#12288;得好东东ZA我的最爱。
优&#12288;&#12288;点:没有让这次的尝试失望不&#12288;&#12288;足:货运慢,慢,慢心&#12288;&#12288;得:很舒适,用的不错
标&#12288;&#12288;签:挺保湿的心&#12288;&#12288;得:一直用还可以~~~~~~~~~~~~~~~~
很滋润效果好味道接受
朋友推荐,醇润型,有点稠,我是混合型皮肤,很好吸收,不粘腻
乳液很适合,价格比商场便宜
效果挺好的滋润保湿了味道清淡
瓶子盖子都有刮痕了是不是都用过了啊。以前也在卓越买过za的其他化妆品都还算满意。这一次真觉得很恶心以后不会在这买了
好用不知道是不是正品啊
很好用
za乳液不够滋润全新但是怎么没有密封
还不错,一直在用
妈妈收到了
商品的包装居然坏了,像是被拆开过的
蛮滋润的
很润,很好用。味道也不错!
还可以
挺好的,这个用上也不是很油腻..
纯度不够。
这个给婆婆买的,我就用过几次,但感觉挺滋润

200
test/testdata/review.100.res vendored Normal file
View File

@ -0,0 +1,200 @@
标&#12288;&#12288;签:保湿还不错比商场便宜补水效果好乳液很好用是正品心&#12288;&#12288;得:感觉还蛮好吸收的,不错啦
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "保湿", "还", "不错", "比", "商场", "便宜", "补水", "效果", "好", "乳液", "很", "好", "用", "是", "正品", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "感觉", "还", "蛮", "好", "吸收", "的", "", "不错", "啦"]
标&#12288;&#12288;签:还可以心&#12288;&#12288;得:不错~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "还", "可以", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "不错", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~"]
标&#12288;&#12288;签:是正品心&#12288;&#12288;得:下次我还要咋京东这里买不错
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "是", "正品", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "下次", "我", "还要", "咋", "京东", "这里", "买", "不错"]
标&#12288;&#12288;签:挺保湿的心&#12288;&#12288;得:价格实惠,适合夏天用,很轻薄
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "挺", "保湿", "的", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "价格", "实惠", "", "适合", "夏天", "用", "", "很", "轻薄"]
标&#12288;&#12288;签:皮肤滑滑的味道不错挺保湿的很好用物流速度快心&#12288;&#12288;得:使用的挺好的一直用着这个的
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "皮肤", "滑", "滑", "的", "味道", "不错", "挺", "保湿", "的", "很", "好", "用", "物流", "速度", "快", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "使用", "的", "挺", "好", "的", "一直", "用", "着", "这个", "的"]
标&#12288;&#12288;签:价格实惠比商场便宜心&#12288;&#12288;得:不错不错,活动买的很划算
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "价格", "实惠", "比", "商场", "便宜", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "不错", "不错", "", "活动", "买", "的", "很", "划算"]
标&#12288;&#12288;签:吸收快品牌好是正品挺保湿的心&#12288;&#12288;得一直使用3年值得信赖好用
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "吸收", "快", "品牌", "好", "是", "正品", "挺", "保湿", "的", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "一直", "使用", "3", "年", "", "值得", "信赖", "", "好", "用"]
标&#12288;&#12288;签:是正品皮肤滑滑的补水效果好乳液很好用心&#12288;&#12288;得:不错不错老婆很喜欢我值
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "是", "正品", "皮肤", "滑", "滑", "的", "补水", "效果", "好", "乳液", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "不错", "不错", "老婆", "很", "喜欢", "我", "值"]
标&#12288;&#12288;签:保湿还不错心&#12288;&#12288;得:挺好的。。。。。。。。。。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "保湿", "还", "不错", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "挺", "好", "的", "。", "。", "。", "。", "。", "。", "。", "。", "。", "。"]
标&#12288;&#12288;签:是正品很好用心&#12288;&#12288;得:一直在京东买,可以信赖
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "是", "正品", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "一直", "在", "京东", "买", "", "可以", "信赖"]
标&#12288;&#12288;签:是正品挺保湿的效果不错心&#12288;&#12288;得:送货快!是正品,大品牌的用的放心!
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "是", "正品", "挺", "保湿", "的", "效果", "不错", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "送货", "快", "", "是", "正品", "", "大", "品牌", "的", "用", "的", "放心", ""]
标&#12288;&#12288;签:乳液很好用心&#12288;&#12288;得:很好的东东,下次还会买
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "乳液", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "很", "好", "的", "东东", "", "下次", "还", "会", "买"]
心&#12288;&#12288;得:送同学的,希望她喜欢
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "送", "同学", "的", "", "希望", "她", "喜欢"]
标&#12288;&#12288;签:价格实惠心&#12288;&#12288;得:一直用,还可以吧,性价比高
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "价格", "实惠", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "一直", "用", "", "还", "可以", "吧", "", "性价比", "高"]
心&#12288;&#12288;得:不错够速度,效果也不错,希望大家用着也一样,顶顶顶
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "不错", "够", "速度", "", "效果", "也", "不错", "", "希望", "大家", "用", "着", "也", "一样", "", "顶", "顶", "顶"]
标&#12288;&#12288;签:挺保湿的心&#12288;&#12288;得:用着还不错。挺好的。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "挺", "保湿", "的", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "用", "着", "还", "不错", "。", "挺", "好", "的", "。"]
优&#12288;&#12288;点:东西很好哦!不&#12288;&#12288;足:暂时还没有发现缺点哦!心&#12288;&#12288;得:很好,也很划算
["优", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "点", "", "东西", "很", "好", "哦", "!", "不", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "足", "", "暂时", "还", "没有", "发现", "缺点", "哦", "", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "很", "好", "", "也", "很", "划算"]
标&#12288;&#12288;签:脸上很舒服是正品心&#12288;&#12288;得:哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈和
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "脸上", "很", "舒服", "是", "正品", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "哈哈哈", "哈哈哈", "哈哈哈", "哈哈哈", "哈哈哈", "和"]
优&#12288;&#12288;点:用了一下,感觉还不错不&#12288;&#12288;足:暂时还没有发现缺点哦!心&#12288;&#12288;得:用了一下,还可以
["优", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "点", "", "用", "了", "一下", "", "感觉", "还", "不错", "不", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "足", "", "暂时", "还", "没有", "发现", "缺点", "哦", "", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "用", "了", "一下", "", "还", "可以"]
标&#12288;&#12288;签:品牌好心&#12288;&#12288;得:东西还行,就是线太少了
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "品牌", "好心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "东西", "还", "行", "", "就是", "线", "太", "少", "了"]
标&#12288;&#12288;签:还可以老婆买的心&#12288;&#12288;得:代买的,据说还不错,搞优惠屯着。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "还", "可以", "老婆", "买", "的", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "代", "买", "的", "", "据说", "还", "不错", "", "搞", "优惠", "屯", "着", "。"]
标&#12288;&#12288;签:保湿还不错很好用心&#12288;&#12288;得:一直在用这个,现在继续。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "保湿", "还", "不错", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "一直", "在", "用", "这个", "", "现在", "继续", "。"]
标&#12288;&#12288;签:很好用心&#12288;&#12288;得:正品,方便好用,比店里便宜
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "正品", "", "方便", "好", "用", "", "比", "店里", "便宜"]
标&#12288;&#12288;签:保湿还不错妈妈买的比商场便宜挺保湿的吸收快心&#12288;&#12288;得:可以先去专柜试试~然后再京东上购买,由京东的发票,还是比较放心的~
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "保湿", "还", "不错", "妈妈", "买", "的", "比", "商场", "便宜", "挺", "保湿", "的", "吸收", "快", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "可以", "先", "去", "专柜", "试试", "~", "然后", "再", "京东", "上", "购买", "", "由", "京东", "的", "发票", "", "还是", "比较", "放心", "的", "~"]
心&#12288;&#12288;得:很好很滋润又不油
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "很", "好", "很", "滋润", "又", "不", "油"]
标&#12288;&#12288;签:吸收快脸上很舒服保湿还不错很好用比商场便宜心&#12288;&#12288;得用过几瓶了http://club.jd.com/JdVote/TradeComment.aspx?ruleid=586763684&ot=0#none感觉很不错不油腻吸收快还保湿。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "吸收", "快", "脸上", "很", "舒服", "保湿", "还", "不错", "很", "好", "用", "比", "商场", "便宜", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "用", "过", "几瓶", "了", "h", "t", "t", "p", ":", "/", "/", "c", "l", "u", "b", ".", "j", "d", ".", "c", "o", "m", "/", "J", "d", "V", "o", "t", "e", "/", "T", "r", "a", "d", "e", "C", "o", "m", "m", "e", "n", "t", ".", "a", "s", "p", "x", "?", "r", "u", "l", "e", "i", "d", "=", "5", "8", "6", "7", "6", "3", "6", "8", "4", "&", "o", "t", "=", "0", "#", "n", "o", "n", "e", "", "感觉", "很", "不错", "", "不", "油腻", "", "吸收", "快", "", "还", "保湿", "。"]
标&#12288;&#12288;签:还可以心&#12288;&#12288;得:一般吧,还没怎么用。现在不知道效果。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "还", "可以", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "一般", "吧", "", "还", "没", "怎么", "用", "。", "现在", "不", "知道", "效果", "。"]
标&#12288;&#12288;签:还可以心&#12288;&#12288;得:东西很不错。很好。很喜欢!
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "还", "可以", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "东西", "很", "不错", "。", "很", "好", "。", "很", "喜欢", ""]
标&#12288;&#12288;签:比商场便宜价格实惠心&#12288;&#12288;得:一直都在用,没有刺激,很舒服,价格合适
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "比", "商场", "便宜", "价格", "实惠", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "一直", "都", "在", "用", "", "没有", "刺激", "", "很", "舒服", "", "价格", "合适"]
标&#12288;&#12288;签:包装好服务好比商场便宜皮肤滑滑的很好用心&#12288;&#12288;得:送货速度也很快!非常好,质量不错,推荐购买!包装很好!
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "包装", "好", "服务", "好比", "商场", "便宜", "皮肤", "滑", "滑", "的", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "送货", "速度", "也", "很快", "", "非常", "好", "", "质量", "不错", "", "推荐", "购买", "", "包装", "很", "好", ""]
标&#12288;&#12288;签:吸收快服务好心&#12288;&#12288;得:质量不错,值得信赖,网购上京东,放心又轻松!
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "吸收", "快", "服务", "好心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "质量", "不错", "", "值得", "信赖", "", "网", "购", "上", "京东", "", "放心", "又", "轻松", ""]
标&#12288;&#12288;签:味道不错吸收快心&#12288;&#12288;得:不油腻,味道也不错,美白效果嘛暂时没有,毕竟只用了几次而已。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "味道", "不错", "吸收", "快", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "不", "油腻", "", "味道", "也", "不错", "", "美", "白", "效果", "嘛", "暂时", "没有", "", "毕竟", "只用", "了", "几次", "而已", "。"]
标&#12288;&#12288;签:还可以价格实惠心&#12288;&#12288;得:还不错,促销活动买的.........
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "还", "可以", "价格", "实惠", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "还", "不错", "", "促销", "活动", "买", "的", ".", ".", ".", ".", ".", ".", ".", ".", "."]
标&#12288;&#12288;签:挺保湿的效果不错脸上很舒服很好用心&#12288;&#12288;得帮朋友买的她觉得非常不错继续关注ZA
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "挺", "保湿", "的", "效果", "不错", "脸上", "很", "舒服", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "帮", "朋友", "买", "的", "", "她", "觉得", "非常", "不错", "", "继续", "关注", "Z", "A"]
标&#12288;&#12288;签:乳液很好用心&#12288;&#12288;得:比较清爽,补水效果并不是很好,夏天用用吧
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "乳液", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "比较", "清爽", "", "补水", "效果", "并", "不是", "很", "好", "", "夏天", "用", "用", "吧"]
标&#12288;&#12288;签:是正品补水效果好还可以心&#12288;&#12288;得:补水效果不错,很好用
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "是", "正品", "补水", "效果", "好", "还", "可以", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "补水", "效果", "不错", "", "很", "好", "用"]
优&#12288;&#12288;点:东西很好哦!不&#12288;&#12288;足:暂时还没有发现缺点哦!心&#12288;&#12288;得:一直在用,信任京东,感觉不错,下次再来。。
["优", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "点", "", "东西", "很", "好", "哦", "", "不", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "足", "", "暂时", "还", "没有", "发现", "缺点", "哦", "", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "一直", "在", "用", "", "信任", "京东", "", "感觉", "不错", "", "下次", "再", "来", "。", "。"]
标&#12288;&#12288;签:皮肤滑滑的味道不错价格实惠保湿还不错乳液很好用心&#12288;&#12288;得:用的很好的下次还会购买
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "皮肤", "滑", "滑", "的", "味道", "不错", "价格", "实惠", "保湿", "还", "不错", "乳液", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "用", "的", "很", "好", "的", "下次", "还", "会", "购买"]
标&#12288;&#12288;签:很好用皮肤滑滑的心&#12288;&#12288;得:好用啊,一如既往的好用
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "很", "好", "用", "皮肤", "滑", "滑", "的", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "好", "用", "啊", "", "一如既往", "的", "好", "用"]
心&#12288;&#12288;得:买了以后就知道不后悔的呢
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "买", "了", "以后", "就", "知道", "不", "后悔", "的", "呢"]
心&#12288;&#12288;得:非常满意,五星
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "非常", "满意", "", "五星"]
心&#12288;&#12288;得:非常满意,五星
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "非常", "满意", "", "五星"]
心&#12288;&#12288;得:非常满意,五星
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "非常", "满意", "", "五星"]
心&#12288;&#12288;得:宝贝很喜欢,连作业都不肯做,在那儿看呢,呵呵
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "宝贝", "很", "喜欢", "", "连", "作业", "都", "不肯", "做", "", "在", "那儿", "看", "呢", "", "呵呵"]
心&#12288;&#12288;得:非常满意,五星
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "非常", "满意", "", "五星"]
心&#12288;&#12288;得:非常满意,五星
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "非常", "满意", "", "五星"]
心&#12288;&#12288;得:非常满意,五星
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "非常", "满意", "", "五星"]
心&#12288;&#12288;得:非常满意,五星
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "非常", "满意", "", "五星"]
标&#12288;&#12288;签:服务好很好用心&#12288;&#12288;得:不错,正品,还会继续关注
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "服务", "好", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "不错", "", "正品", "", "还", "会", "继续", "关注"]
标&#12288;&#12288;签:乳液很好用心&#12288;&#12288;得:比较滋润还不错。。。。。。。。。。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "乳液", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "比较", "滋润", "还", "不错", "。", "。", "。", "。", "。", "。", "。", "。", "。", "。"]
标&#12288;&#12288;签:品牌好心&#12288;&#12288;得:送货快,还没有用,具体效果还不清楚
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "品牌", "好心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "送货", "快", "", "还", "没有", "用", "", "具体", "效果", "还", "不", "清楚"]
标&#12288;&#12288;签:很好用心&#12288;&#12288;得:一直用这个,在京东买方便。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "一直", "用", "这个", "", "在", "京东", "买", "方便", "。"]
标&#12288;&#12288;签:保湿还不错包装好脸上很舒服吸收快物流速度快心&#12288;&#12288;得:必须要说的是,这是我老婆自己买的。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "保湿", "还", "不错", "包装", "好", "脸上", "很", "舒服", "吸收", "快", "物流", "速度", "快", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "必须", "要说", "的", "是", "", "这", "是", "我", "老婆", "自己", "买", "的", "。"]
标&#12288;&#12288;签:效果不错心&#12288;&#12288;得:一直用这个存货中**************
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "效果", "不错", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "一直", "用", "这个", "存货", "中", "*", "*", "*", "*", "*", "*", "*", "*", "*", "*", "*", "*", "*", "*"]
标&#12288;&#12288;签:很好用心&#12288;&#12288;得:还可以,常规的东东。.
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "还", "可以", "", "常规", "的", "东东", "。", "."]
标&#12288;&#12288;签:包装好乳液很好用补水效果好物流速度快价格实惠心&#12288;&#12288;得:挺好的,脸上不紧绷,舒服
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "包装", "好", "乳液", "很", "好", "用", "补水", "效果", "好", "物流", "速度", "快", "价格", "实惠", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "挺", "好", "的", "", "脸上", "不", "紧", "绷", "", "舒服"]
标&#12288;&#12288;签:物流速度快价格实惠心&#12288;&#12288;得:应该是正品吧,价格比超市便宜些。正在使用中
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "物流", "速度", "快", "价格", "实惠", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "应该", "是", "正品", "吧", "", "价格比", "超市", "便宜", "些", "。", "正在", "使用", "中"]
标&#12288;&#12288;签:还可以心&#12288;&#12288;得:挺滋润的,价钱也合适!
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "还", "可以", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "挺", "滋润", "的", "", "价钱", "也", "合适", ""]
标&#12288;&#12288;签:是正品效果不错心&#12288;&#12288;得:用过以后效果挺好的,不错是正品
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "是", "正品", "效果", "不错", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "用", "过", "以后", "效果", "挺", "好", "的", "", "不错", "是", "正品"]
标&#12288;&#12288;签:很好用比商场便宜心&#12288;&#12288;得:用这个产品一年了,比较认可。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "很", "好", "用", "比", "商场", "便宜", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "用", "这个", "产品", "一年", "了", "", "比较", "认可", "。"]
标&#12288;&#12288;签:保湿还不错心&#12288;&#12288;得:第一次用乳液,感觉还不错
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "保湿", "还", "不错", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "第一次", "用", "乳液", "", "感觉", "还", "不错"]
标&#12288;&#12288;签:价格实惠心&#12288;&#12288;得:便宜,东西还行吧,用着不习惯,感觉有酒精
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "价格", "实惠", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "便宜", "", "东西", "还", "行", "吧", "", "用", "着", "不", "习惯", "", "感觉", "有", "酒精"]
标&#12288;&#12288;签:价格实惠包装好心&#12288;&#12288;得:看牌子买的,先试着用用看效果
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "价格", "实惠", "包装", "好心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "看", "牌子", "买", "的", "", "先", "试", "着", "用", "用", "看", "效果"]
心&#12288;&#12288;得:配套用的不错个人觉得
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "配套", "用", "的", "不错", "个人", "觉得"]
标&#12288;&#12288;签:味道刺激心&#12288;&#12288;得:不怎么样,用后脸上会起红点
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "味道", "刺激", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "不怎么样", "", "用", "后", "脸上", "会", "起", "红", "点"]
标&#12288;&#12288;签:挺保湿的物流速度快比商场便宜品牌好心&#12288;&#12288;得:正品,平价,比商场便宜,物流很快。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "挺", "保湿", "的", "物流", "速度", "快", "比", "商场", "便宜", "品牌", "好心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "正品", "", "平价", "", "比", "商场", "便宜", "", "物流", "很快", "。"]
标&#12288;&#12288;签:服务好心&#12288;&#12288;得还没有使用过就发现YMX只要79元我哭为什么京东价格拼不过YMX呀~~~
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "服务", "好心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "还", "没有", "使用", "过", "", "就", "发现", "Y", "M", "X", "只要", "7", "9", "元", "", "我", "哭", "", "为什么", "京东", "价格", "拼", "不过", "Y", "M", "X", "呀", "~", "~", "~"]
标&#12288;&#12288;签:挺保湿的心&#12288;&#12288;得:第一次购买,用了感觉还不错
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "挺", "保湿", "的", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "第一次", "购买", ",", "用", "了", "感觉", "还", "不错"]
标&#12288;&#12288;签:服务好物流速度快脸上很舒服心&#12288;&#12288;得:刚送到家。。用用在发表好坏。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "服务", "好", "物流", "速度", "快", "脸上", "很", "舒服", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "刚", "送到", "家", "。", "。", "用", "用", "在", "发表", "好坏", "。"]
心&#12288;&#12288;得:还没用看看包装蛮好的晒&#12288;&#12288;单共3张图片查看晒单>
["心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "还", "没用", "看看", "包装", "蛮", "好", "的", "晒", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "单", "", "共", "3", "张", "图片", "查看", "晒", "单", ">"]
标&#12288;&#12288;签:品牌好价格实惠脸上很舒服味道不错心&#12288;&#12288;得:防晒,不油腻,还可以使皮肤稍稍增白些,
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "品牌", "好", "价格", "实惠", "脸上", "很", "舒服", "味道", "不错", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "防晒", "", "不", "油腻", "", "还", "可以", "使", "皮肤", "稍稍", "增白", "些", ""]
标&#12288;&#12288;签:价格实惠保湿还不错心&#12288;&#12288;得:东西好用,分不清楚是不是正品。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "价格", "实惠", "保湿", "还", "不错", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "东西", "好", "用", "", "分", "不", "清楚", "是不是", "正品", "。"]
标&#12288;&#12288;签:服务好乳液很好用心&#12288;&#12288;得:乳液还是不错的用用不错的
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "服务", "好", "乳液", "很", "好", "用心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "乳液", "还是", "不错", "的", "用", "用", "不错", "的"]
标&#12288;&#12288;签:物流速度快效果不错心&#12288;&#12288;得:常用这个,夏天用,美白效果还好
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "物流", "速度", "快", "效果", "不错", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "常用", "这个", "", "夏天", "用", "", "美", "白", "效果", "还好"]
标&#12288;&#12288;签:还可以心&#12288;&#12288;得:不错
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "还", "可以", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "不错"]
标&#12288;&#12288;签:价格实惠比商场便宜服务好心&#12288;&#12288;得:真的还不错而且价格也实惠快递速度
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "价格", "实惠", "比", "商场", "便宜", "服务", "好心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "真的", "还", "不错", "而且", "价格", "也", "实惠", "快递", "速度"]
标&#12288;&#12288;签:比商场便宜脸上很舒服很好用物流速度快是正品心&#12288;&#12288;得:京东就是好一日既往的好
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "比", "商场", "便宜", "脸上", "很", "舒服", "很", "好", "用", "物流", "速度", "快", "是", "正品", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "京东", "就是", "好", "一日", "既往", "的", "好"]
活动时购买的很划算,用下来觉得还可以吧,等用完了才能知道有没有效果吧。反正很划算,随便用用看
["活动", "时", "购买", "的", "很", "划算", "", "用", "下来", "觉得", "还", "可以", "吧", "", "等", "用", "完", "了", "才能", "知道", "有没有", "效果", "吧", "。", "反正", "很", "划算", "", "随便", "用", "用", "看"]
新能真皙美白乳液很好用,有美白的效果,吸收也很快,搞活动买的,比外面便宜好多~~~~~
["新", "能", "真", "皙", "美", "白", "乳液", "很", "好", "用", "", "有", "美", "白", "的", "效果", "", "吸收", "也", "很快", "", "搞", "活动", "买", "的", "", "比", "外面", "便宜", "好多", "~", "~", "~", "~", "~"]
三八妇女节买的Z的产品随便用用可以的。女人要对自己好一点。
["三八妇女节", "买", "的", "", "Z", "的", "产品", "随便", "用", "用", "可以", "的", "。", "女人", "要", "对", "自己", "好", "一点", "。"]
标&#12288;&#12288;签:是正品挺保湿的心&#12288;&#12288;得好东东ZA我的最爱。
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "是", "正品", "挺", "保湿", "的", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "好", "东东", "", "Z", "A", "我", "的", "最", "爱", "。"]
优&#12288;&#12288;点:没有让这次的尝试失望不&#12288;&#12288;足:货运慢,慢,慢心&#12288;&#12288;得:很舒适,用的不错
["优", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "点", "", "没有", "让", "这次", "的", "尝试", "失望", "不", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "足", "", "货运", "慢", "", "慢", "", "慢", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "很", "舒适", "", "用", "的", "不错"]
标&#12288;&#12288;签:挺保湿的心&#12288;&#12288;得:一直用还可以~~~~~~~~~~~~~~~~
["标", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "签", "", "挺", "保湿", "的", "心", "&", "#", "1", "2", "2", "8", "8", ";", "&", "#", "1", "2", "2", "8", "8", ";", "得", "", "一直", "用", "还", "可以", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~", "~"]
很滋润效果好味道接受
["很", "滋润", "效果", "好", "味道", "接受"]
朋友推荐,醇润型,有点稠,我是混合型皮肤,很好吸收,不粘腻
["朋友", "推荐", "", "醇", "润", "型", "", "有点", "稠", "", "我", "是", "混合型", "皮肤", "", "很", "好", "吸收", "", "不", "粘", "腻"]
乳液很适合,价格比商场便宜
["乳液", "很", "适合", "", "价格比", "商场", "便宜"]
效果挺好的滋润保湿了味道清淡
["效果", "挺", "好", "的", "滋润", "保湿", "了", "味道", "清淡"]
瓶子盖子都有刮痕了是不是都用过了啊。以前也在卓越买过za的其他化妆品都还算满意。这一次真觉得很恶心以后不会在这买了
["瓶子", "盖子", "都", "有", "刮", "痕", "了", "", "是不是", "都", "用", "过", "了", "啊", "。", "以前", "也", "在", "卓越", "买", "过", "z", "a", "的", "其他", "化妆品", "", "都", "还", "算", "满意", "。", "这", "一次", "真", "觉得", "很", "恶心", "", "以后", "不会", "在", "这", "买", "了"]
好用不知道是不是正品啊
["好", "用", "不", "知道", "是不是", "正品", "啊"]
很好用
["很", "好", "用"]
za乳液不够滋润全新但是怎么没有密封
["z", "a", "乳液", "不够", "滋润", "", "全新", "但是", "怎么", "没有", "密封", ""]
还不错,一直在用
["还", "不错", "", "一直", "在", "用"]
妈妈收到了
["妈妈", "收到", "了"]
商品的包装居然坏了,像是被拆开过的
["商品", "的", "包装", "居然", "坏", "了", "", "像是", "被", "拆开", "过", "的"]
蛮滋润的
["蛮", "滋润", "的"]
很润,很好用。味道也不错!
["很", "润", "", "很", "好", "用", "。", "味道", "也", "不错", ""]
还可以
["还", "可以"]
挺好的,这个用上也不是很油腻..
["挺", "好", "的", "", "这个", "用", "上", "也", "不是", "很", "油腻", ".", "."]
纯度不够。
["纯度", "不够", "。"]
这个给婆婆买的,我就用过几次,但感觉挺滋润
["这个", "给", "婆婆", "买", "的", "", "我", "就", "用", "过", "几次", "", "但", "感觉", "挺", "滋润"]

19
test/testdata/server.conf vendored Normal file
View File

@ -0,0 +1,19 @@
# config
#socket listen port
port=11200
thread_number=4
queue_max_size=4096
#dict path
dict_path=../dict/jieba.dict.utf8
#model path
model_path=../dict/hmm_model.utf8
user_dict_path=../dict/user.dict.utf8
idf_path=../dict/idf.utf8
stop_words_path=../dict/stop_words.utf8

9
test/testdata/testlines.gbk vendored Normal file
View File

@ -0,0 +1,9 @@
我来到北京清华大学
他来到了网易杭研大厦
杭研
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
我来自北京邮电大学。。。学号091111xx。。。
来这里看看别人正在搜索什么吧
我来到南京市长江大桥
请在一米线外等候
人事处女干事

Some files were not shown because too many files have changed in this diff Show More