Compare commits

...

34 Commits

Author SHA1 Message Date
Yanyi Wu
294755fab1 build: refine CMakeLists.txt by removing unnecessary conditions and options
- Eliminated the default installation prefix condition to streamline the configuration.
- Simplified the test build logic by ensuring tests are enabled only for top-level projects.
- Cleaned up redundant code for better readability and maintainability.
2025-05-03 07:43:25 +08:00
Yanyi Wu
714a297823 build: update CMakeLists.txt to include additional directories for test configuration
- Added include directories for the current binary and test directories to improve test file accessibility.
- Ensured proper configuration for test paths in the build process.
2025-05-02 23:47:37 +08:00
Yanyi Wu
c14131e3e2 refactor: clean up load_test.cpp by removing unused dependencies and tests
- Removed unused Jieba test and associated includes from load_test.cpp.
- Simplified main function to focus on essential operations.
- Ensured consistent exit handling by returning EXIT_SUCCESS.
2025-05-02 23:41:53 +08:00
Yanyi Wu
9cd64a1694 build: enhance test configuration and path management
- Added configuration for test paths in CMake to simplify file references.
- Updated load_test.cpp and various unit tests to use defined path macros for dictionary and test data files.
- Introduced test_paths.h.in to manage directory paths consistently across tests.
2025-05-02 23:33:18 +08:00
Yanyi Wu
aa410a69bb build: simplify test configuration in CMakeLists.txt
- Removed conditional check for MSVC when adding test commands.
- Ensured that test commands are always added regardless of the compiler.
2025-05-02 21:39:18 +08:00
Yanyi Wu
b5dc8e7a35 build: update .gitignore and CMakeLists for test configuration
- Added entries to .gitignore for temporary test files.
- Included a message to display MSVC value during build.
- Added UTF-8 compile option for MSVC in unittest CMakeLists.
2025-05-02 21:28:28 +08:00
Yanyi Wu
8141d8f434
Merge pull request #200 from yanyiwu/dev
fix: remove outdated entry from jieba dictionary
2025-05-02 17:31:29 +08:00
yanyiwu
9d8af2116e build: update CI workflow to include latest OS versions 2025-05-02 11:53:33 +08:00
yanyiwu
2185315643 fix: remove outdated entry from jieba dictionary 2025-05-02 11:38:31 +08:00
yanyiwu
340de007f9 docs: update README.md 2025-04-13 18:59:44 +08:00
yanyiwu
940ea02eb4 deps: upgrade limonp from v1.0.0 to v1.0.1 2025-04-12 17:54:01 +08:00
yanyiwu
3732abc0e5 docs: update CHANGELOG for v5.5.0 2025-04-12 10:07:40 +08:00
yanyiwu
9cda7f33e8 build: upgrade googletest from 1.11.0 to 1.12.1 2025-04-12 10:02:10 +08:00
Yanyi Wu
338603b676
Merge pull request #196 from ahmadov/ahmadov/fix-ns-2
avoid implicit namespaces
2025-04-11 08:59:41 +08:00
Elmi Ahmadov
d93dda397c avoid implicit namespaces
This PR fixes the ambigious `partial_sort` in KeywordExtractor.hpp.
We also have a definition for it and the compiler is consufed which
implementation should be used. To fix it, we can use the `std` namespace
explicitly.

Also, use the `std` namespace for the other data structures and include
their headers.
2025-04-10 19:10:05 +02:00
Yanyi Wu
7730deee52
Merge pull request #195 from ahmadov/ahmadov/fix-ns
fix missing includes and make namespaces explicit
2025-04-10 23:01:18 +08:00
Elmi Ahmadov
588860b5b6 fix missing includes and make namespaces explicit 2025-04-10 16:11:20 +02:00
Yanyi Wu
0523949aa8
Update stale-issues.yml 2025-04-05 17:26:58 +08:00
Yanyi Wu
b11fd29697
Update README.md 2025-03-08 17:33:48 +08:00
yanyiwu
15b8086a2a Add CMake workflow for Windows ARM64 builds
This commit introduces a new GitHub Actions workflow for building and testing CMake projects on Windows ARM64. The workflow includes steps for checking out the repository, configuring CMake with multiple C++ standards, building the project, and running tests. This enhancement supports continuous integration for ARM64 architecture, improving the project's build versatility.
2025-01-18 20:58:17 +08:00
yanyiwu
1d74caf705 Update CMake minimum version requirement to 3.10 2025-01-18 20:47:06 +08:00
Yanyi Wu
0c7c5228d0
Update README.md 2025-01-17 23:47:09 +08:00
yanyiwu
016fc17575 Improve error logging for UTF-8 decoding failures across cppjieba components. Updated error messages in DictTrie, PosTagger, PreFilter, and SegmentBase to provide clearer context on the specific input causing the failure. This change enhances the debugging experience when handling UTF-8 encoded strings. 2024-12-08 17:26:28 +08:00
yanyiwu
39fc58f081 Remove macOS 12 from CI workflow in cmake.yml 2024-12-08 17:03:39 +08:00
yanyiwu
42a93a4b98 Refactor decoding functions to use UTF-8 compliant methods
Updated multiple files to replace instances of DecodeRunesInString with DecodeUTF8RunesInString, ensuring proper handling of UTF-8 encoded strings. This change enhances the robustness of string decoding across the cppjieba library, including updates in DictTrie, HMMModel, PosTagger, PreFilter, SegmentBase, and Unicode files. Additionally, corresponding unit tests have been modified to reflect these changes.
2024-12-08 16:46:24 +08:00
yanyiwu
5ee74d788e [stale-isssues] Monthly on the 3rd day of the month at midnight 2024-11-03 17:22:28 +08:00
yanyiwu
9b45e084a3 v5.4.0 2024-09-22 10:02:53 +08:00
yanyiwu
aa1def5ddb class Jiaba unittest add default argument input 2024-09-22 09:43:04 +08:00
yanyiwu
732812cdfb class Jieba: support default dictpath 2024-09-22 09:38:31 +08:00
yanyiwu
6e167a30dd cmake: avoid testing when FetchContent by other project 2024-09-22 00:25:23 +08:00
yanyiwu
5ef74f335a Revert "cmake: enable windows/msvc test"
This reverts commit 63392627552b018ea018848c82965c263b0030fa.
2024-09-21 23:58:59 +08:00
yanyiwu
6339262755 cmake: enable windows/msvc test 2024-09-21 21:49:56 +08:00
yanyiwu
cc58d4f858 DictTrie: removed unused var 2024-09-21 21:29:55 +08:00
yanyiwu
dbebc7cacb cmake: enable windows/msvc test 2024-09-21 21:10:53 +08:00
28 changed files with 430 additions and 310 deletions

40
.github/workflows/cmake-arm64.yml vendored Normal file
View File

@ -0,0 +1,40 @@
name: CMake Windows ARM64
on:
push:
pull_request:
workflow_dispatch:
env:
BUILD_TYPE: Release
jobs:
build-windows-arm64:
runs-on: windows-2022
strategy:
matrix:
cpp_version: [11, 14, 17, 20]
steps:
- name: Check out repository code
uses: actions/checkout@v2
with:
submodules: recursive
- name: Configure CMake
# Configure CMake in a 'build' subdirectory. `CMAKE_BUILD_TYPE` is only required if you are using a single-configuration generator such as make.
# See https://cmake.org/cmake/help/latest/variable/CMAKE_BUILD_TYPE.html?highlight=cmake_build_type
# run: cmake -B ${{github.workspace}}/build -DCMAKE_BUILD_TYPE=${{env.BUILD_TYPE}}
run: cmake -B ${{github.workspace}}/build -DBUILD_TESTING=ON -DCMAKE_CXX_STANDARD=${{matrix.cpp_version}} -DCMAKE_BUILD_TYPE=${{env.BUILD_TYPE}}
- name: Build
# Build your program with the given configuration
# run: cmake --build ${{github.workspace}}/build --config ${{env.BUILD_TYPE}}
run: cmake --build ${{github.workspace}}/build --config ${{env.BUILD_TYPE}}
- name: Test
working-directory: ${{github.workspace}}/build
# Execute tests defined by the CMake configuration.
# See https://cmake.org/cmake/help/latest/manual/ctest.1.html for more detail
run: ctest -C ${{env.BUILD_TYPE}} --verbose

View File

@ -16,14 +16,15 @@ jobs:
runs-on: ${{ matrix.os }} runs-on: ${{ matrix.os }}
strategy: strategy:
matrix: matrix:
os: [ os: [
ubuntu-20.04, ubuntu-22.04,
ubuntu-22.04, ubuntu-latest,
macos-12,
macos-13, macos-13,
macos-14, macos-14,
macos-latest,
windows-2019, windows-2019,
windows-2022, windows-2022,
windows-latest,
] ]
cpp_version: [11, 14, 17, 20] cpp_version: [11, 14, 17, 20]
@ -48,5 +49,5 @@ jobs:
working-directory: ${{github.workspace}}/build working-directory: ${{github.workspace}}/build
# Execute tests defined by the CMake configuration. # Execute tests defined by the CMake configuration.
# See https://cmake.org/cmake/help/latest/manual/ctest.1.html for more detail # See https://cmake.org/cmake/help/latest/manual/ctest.1.html for more detail
run: ctest -C ${{env.BUILD_TYPE}} run: ctest -C ${{env.BUILD_TYPE}} --verbose

View File

@ -2,7 +2,8 @@ name: Close Stale Issues
on: on:
schedule: schedule:
- cron: '0 0 * * 0' # Run weekly on Sunday at midnight - cron: '0 0 3 */3 *' # Every three months on the 3rd day at midnight
jobs: jobs:
stale: stale:

2
.gitignore vendored
View File

@ -15,3 +15,5 @@ tmp
t.* t.*
*.pid *.pid
build build
Testing/Temporary/CTestCostData.txt
Testing/Temporary/LastTest.log

View File

@ -1,5 +1,20 @@
# CHANGELOG # CHANGELOG
## v5.5.0
+ feat: add Windows ARM64 build support
+ build: upgrade googletest from 1.11.0 to 1.12.1
+ build: update CMake minimum version requirement to 3.10
+ fix: make namespaces explicit and fix missing includes
+ ci: update stale-issues workflow configuration
## v5.4.0
+ unittest: class Jiaba add default argument input
+ class Jieba: support default dictpath
+ cmake: avoid testing when FetchContent by other project
+ class DictTrie: removed unused var
## v5.3.2 ## v5.3.2
+ removed test/demo.cpp and linked https://github.com/yanyiwu/cppjieba-demo + removed test/demo.cpp and linked https://github.com/yanyiwu/cppjieba-demo

View File

@ -1,25 +1,31 @@
CMAKE_MINIMUM_REQUIRED (VERSION 3.5) CMAKE_MINIMUM_REQUIRED (VERSION 3.10)
PROJECT(CPPJIEBA) PROJECT(CPPJIEBA)
INCLUDE_DIRECTORIES(${PROJECT_SOURCE_DIR}/deps/limonp/include INCLUDE_DIRECTORIES(${PROJECT_SOURCE_DIR}/deps/limonp/include
${PROJECT_SOURCE_DIR}/include) ${PROJECT_SOURCE_DIR}/include)
if (CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT)
set (CMAKE_INSTALL_PREFIX "/usr/local/cppjieba" CACHE PATH "default install path" FORCE )
endif()
if(NOT DEFINED CMAKE_CXX_STANDARD) if(NOT DEFINED CMAKE_CXX_STANDARD)
set(CMAKE_CXX_STANDARD 11) set(CMAKE_CXX_STANDARD 11)
endif() endif()
set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF) set(CMAKE_CXX_EXTENSIONS OFF)
ADD_DEFINITIONS(-O3 -Wall -g) ADD_DEFINITIONS(-O3 -g)
ADD_SUBDIRECTORY(test) # Define a variable to check if this is the top-level project
if(NOT DEFINED CPPJIEBA_TOP_LEVEL_PROJECT)
ENABLE_TESTING() if(CMAKE_CURRENT_SOURCE_DIR STREQUAL CMAKE_SOURCE_DIR)
if(NOT MSVC) set(CPPJIEBA_TOP_LEVEL_PROJECT ON)
ADD_TEST(NAME ./test/test.run COMMAND ./test/test.run) else()
ADD_TEST(NAME ./load_test COMMAND ./load_test) set(CPPJIEBA_TOP_LEVEL_PROJECT OFF)
endif()
endif() endif()
if(CPPJIEBA_TOP_LEVEL_PROJECT)
ENABLE_TESTING()
message(STATUS "MSVC value: ${MSVC}")
ADD_SUBDIRECTORY(test)
ADD_TEST(NAME ./test/test.run COMMAND ./test/test.run)
ADD_TEST(NAME ./load_test COMMAND ./load_test)
endif()

115
README.md
View File

@ -10,22 +10,25 @@
CppJieba是"结巴(Jieba)"中文分词的C++版本 CppJieba是"结巴(Jieba)"中文分词的C++版本
## 特性 ### 主要特点
+ 源代码都写进头文件`include/cppjieba/*.hpp`里,`include`即可使用。 - 🚀 高性能:经过线上环境验证的稳定性和性能表现
+ 支持`utf8`编码。 - 📦 易集成:源代码以头文件形式提供 (`include/cppjieba/*.hpp`),包含即可使用
+ 项目自带较为完善的单元测试,核心功能中文分词(utf8)的稳定性接受过线上环境检验。 - 🔍 多种分词模式:支持精确模式、全模式、搜索引擎模式等
+ 支持载自定义用户词典,多路径时支持分隔符'|'或者';'分隔。 - 📚 自定义词典:支持用户自定义词典,支持多词典路径(使用'|'或';'分隔)
+ 支持 `Linux` , `Mac OSX`, `Windows` 操作系统。 - 💻 跨平台:支持 Linux、macOS、Windows 操作系统
- 🌈 UTF-8编码原生支持 UTF-8 编码的中文处理
## 用法 ## 快速开始
### 依赖软件 ### 环境要求
* `g++ (version >= 4.1 is recommended) or clang++`; - C++ 编译器:
* `cmake (version >= 2.6 is recommended)`; - g++ (推荐 4.1 以上版本)
- 或 clang++
- cmake (推荐 2.6 以上版本)
### 下载和编译 ### 安装步骤
```sh ```sh
git clone https://github.com/yanyiwu/cppjieba.git git clone https://github.com/yanyiwu/cppjieba.git
@ -36,15 +39,11 @@ mkdir build
cd build cd build
cmake .. cmake ..
make make
```
有兴趣的可以跑跑测试(可选):
```
make test make test
``` ```
## Demo ## 使用示例
``` ```
./demo ./demo
@ -210,71 +209,37 @@ For more details, please see [demo](https://github.com/yanyiwu/cppjieba-demo).
+ [dict.367W.utf8] iLife(562193561 at qq.com) + [dict.367W.utf8] iLife(562193561 at qq.com)
## 应用 ## 生态系统
+ [GoJieba] go语言版本的结巴中文分词。 CppJieba 已经被广泛应用于各种编程语言的分词实现中:
+ [NodeJieba] Node.js 版本的结巴中文分词。
+ [simhash] 中文文档的的相似度计算
+ [exjieba] Erlang 版本的结巴中文分词。
+ [jiebaR] R语言版本的结巴中文分词。
+ [cjieba] C语言版本的结巴分词。
+ [jieba_rb] Ruby 版本的结巴分词。
+ [iosjieba] iOS 版本的结巴分词。
+ [SqlJieba] MySQL 全文索引的结巴中文分词插件。
+ [pg_jieba] PostgreSQL 数据库的分词插件。
+ [simple] SQLite3 FTS5 数据库的分词插件。
+ [gitbook-plugin-search-pro] 支持中文搜索的 gitbook 插件。
+ [ngx_http_cppjieba_module] Nginx 分词插件。
+ [cppjiebapy] 由 [jannson] 开发的供 python 模块调用的项目 [cppjiebapy], 相关讨论 [cppjiebapy_discussion] .
+ [cppjieba-py] 由 [bung87] 基于 pybind11 封装的 python 模块,使用体验上接近于原jieba。
+ [KeywordServer] 50行搭建一个中文关键词抽取服务。
+ [cppjieba-server] CppJieba HTTP 服务器。
+ [phpjieba] php版本的结巴分词扩展。
+ [perl5-jieba] Perl版本的结巴分词扩展。
+ [jieba-dlang] D 语言的结巴分词 Deimos Bindings。
## 性能评测 - [GoJieba](https://github.com/yanyiwu/gojieba) - Go 语言版本
- [NodeJieba](https://github.com/yanyiwu/nodejieba) - Node.js 版本
- [CJieba](https://github.com/yanyiwu/cjieba) - C 语言版本
- [jiebaR](https://github.com/qinwf/jiebaR) - R 语言版本
- [exjieba](https://github.com/falood/exjieba) - Erlang 版本
- [jieba_rb](https://github.com/altkatz/jieba_rb) - Ruby 版本
- [iosjieba](https://github.com/yanyiwu/iosjieba) - iOS 版本
- [phpjieba](https://github.com/jonnywang/phpjieba) - PHP 版本
- [perl5-jieba](https://metacpan.org/pod/distribution/Lingua-ZH-Jieba/lib/Lingua/ZH/Jieba.pod) - Perl 版本
[Jieba中文分词系列性能评测] ### 应用项目
## Sponsorship - [simhash](https://github.com/yanyiwu/simhash) - 中文文档相似度计算
- [pg_jieba](https://github.com/jaiminpan/pg_jieba) - PostgreSQL 分词插件
- [gitbook-plugin-search-pro](https://plugins.gitbook.com/plugin/search-pro) - Gitbook 中文搜索插件
- [ngx_http_cppjieba_module](https://github.com/yanyiwu/ngx_http_cppjieba_module) - Nginx 分词插件
[![sponsorship](http://images.gitads.io/cppjieba)](https://tracking.gitads.io/?campaign=gitads&repo=cppjieba&redirect=gitads.io) ## 贡献指南
## Contributors 我们欢迎各种形式的贡献,包括但不限于:
### Code Contributors - 提交问题和建议
- 改进文档
This project exists thanks to all the people who contribute. - 提交代码修复
<a href="https://github.com/yanyiwu/cppjieba/graphs/contributors"><img src="https://opencollective.com/cppjieba/contributors.svg?width=890&button=false" /></a> - 添加新功能
[GoJieba]:https://github.com/yanyiwu/gojieba
[CppJieba]:https://github.com/yanyiwu/cppjieba 如果您觉得 CppJieba 对您有帮助,欢迎 star ⭐️ 支持项目!
[jannson]:https://github.com/jannson
[cppjiebapy]:https://github.com/jannson/cppjiebapy
[bung87]:https://github.com/bung87
[cppjieba-py]:https://github.com/bung87/cppjieba-py
[cppjiebapy_discussion]:https://github.com/yanyiwu/cppjieba/issues/1
[NodeJieba]:https://github.com/yanyiwu/nodejieba
[jiebaR]:https://github.com/qinwf/jiebaR
[simhash]:https://github.com/yanyiwu/simhash
[代码详解]:https://github.com/yanyiwu/cppjieba/wiki/CppJieba%E4%BB%A3%E7%A0%81%E8%AF%A6%E8%A7%A3
[issue25]:https://github.com/yanyiwu/cppjieba/issues/25
[exjieba]:https://github.com/falood/exjieba
[KeywordServer]:https://github.com/yanyiwu/keyword_server
[ngx_http_cppjieba_module]:https://github.com/yanyiwu/ngx_http_cppjieba_module
[dict.367W.utf8]:https://github.com/qinwf/BigDict
[cjieba]:http://github.com/yanyiwu/cjieba
[jieba_rb]:https://github.com/altkatz/jieba_rb
[iosjieba]:https://github.com/yanyiwu/iosjieba
[SqlJieba]:https://github.com/yanyiwu/sqljieba
[Jieba中文分词系列性能评测]:http://yanyiwu.com/work/2015/06/14/jieba-series-performance-test.html
[pg_jieba]:https://github.com/jaiminpan/pg_jieba
[gitbook-plugin-search-pro]:https://plugins.gitbook.com/plugin/search-pro
[cppjieba-server]:https://github.com/yanyiwu/cppjieba-server
[phpjieba]:https://github.com/jonnywang/phpjieba
[perl5-jieba]:https://metacpan.org/pod/distribution/Lingua-ZH-Jieba/lib/Lingua/ZH/Jieba.pod
[jieba-dlang]:https://github.com/shove70/jieba
[simple]:https://github.com/wangfenjin/simple

2
deps/limonp vendored

@ -1 +1 @@
Subproject commit ac32f1f287f65d5ce0ce295010c88026fae060ee Subproject commit 5c82a3f17e4e0adc6a5decfe245054b0ed533d1a

View File

@ -312698,7 +312698,6 @@ T恤 4 n
部属 1126 n 部属 1126 n
部属工作 3 n 部属工作 3 n
部属院校 3 n 部属院校 3 n
部手机 33 n
部族 643 n 部族 643 n
部标 4 n 部标 4 n
部省级 2 n 部省级 2 n

View File

@ -1,15 +1,15 @@
#ifndef CPPJIEBA_DICT_TRIE_HPP #ifndef CPPJIEBA_DICT_TRIE_HPP
#define CPPJIEBA_DICT_TRIE_HPP #define CPPJIEBA_DICT_TRIE_HPP
#include <iostream> #include <algorithm>
#include <fstream> #include <fstream>
#include <map>
#include <string>
#include <cstring> #include <cstring>
#include <cstdlib> #include <cstdlib>
#include <stdint.h>
#include <cmath> #include <cmath>
#include <limits> #include <deque>
#include <set>
#include <string>
#include <unordered_set>
#include "limonp/StringUtil.hpp" #include "limonp/StringUtil.hpp"
#include "limonp/Logging.hpp" #include "limonp/Logging.hpp"
#include "Unicode.hpp" #include "Unicode.hpp"
@ -17,8 +17,6 @@
namespace cppjieba { namespace cppjieba {
using namespace limonp;
const double MIN_DOUBLE = -3.14e+100; const double MIN_DOUBLE = -3.14e+100;
const double MAX_DOUBLE = 3.14e+100; const double MAX_DOUBLE = 3.14e+100;
const size_t DICT_COLUMN_NUM = 3; const size_t DICT_COLUMN_NUM = 3;
@ -32,7 +30,7 @@ class DictTrie {
WordWeightMax, WordWeightMax,
}; // enum UserWordWeightOption }; // enum UserWordWeightOption
DictTrie(const string& dict_path, const string& user_dict_paths = "", UserWordWeightOption user_word_weight_opt = WordWeightMedian) { DictTrie(const std::string& dict_path, const std::string& user_dict_paths = "", UserWordWeightOption user_word_weight_opt = WordWeightMedian) {
Init(dict_path, user_dict_paths, user_word_weight_opt); Init(dict_path, user_dict_paths, user_word_weight_opt);
} }
@ -40,7 +38,7 @@ class DictTrie {
delete trie_; delete trie_;
} }
bool InsertUserWord(const string& word, const string& tag = UNKNOWN_TAG) { bool InsertUserWord(const std::string& word, const std::string& tag = UNKNOWN_TAG) {
DictUnit node_info; DictUnit node_info;
if (!MakeNodeInfo(node_info, word, user_word_default_weight_, tag)) { if (!MakeNodeInfo(node_info, word, user_word_default_weight_, tag)) {
return false; return false;
@ -50,7 +48,7 @@ class DictTrie {
return true; return true;
} }
bool InsertUserWord(const string& word,int freq, const string& tag = UNKNOWN_TAG) { bool InsertUserWord(const std::string& word,int freq, const std::string& tag = UNKNOWN_TAG) {
DictUnit node_info; DictUnit node_info;
double weight = freq ? log(1.0 * freq / freq_sum_) : user_word_default_weight_ ; double weight = freq ? log(1.0 * freq / freq_sum_) : user_word_default_weight_ ;
if (!MakeNodeInfo(node_info, word, weight , tag)) { if (!MakeNodeInfo(node_info, word, weight , tag)) {
@ -61,7 +59,7 @@ class DictTrie {
return true; return true;
} }
bool DeleteUserWord(const string& word, const string& tag = UNKNOWN_TAG) { bool DeleteUserWord(const std::string& word, const std::string& tag = UNKNOWN_TAG) {
DictUnit node_info; DictUnit node_info;
if (!MakeNodeInfo(node_info, word, user_word_default_weight_, tag)) { if (!MakeNodeInfo(node_info, word, user_word_default_weight_, tag)) {
return false; return false;
@ -69,23 +67,23 @@ class DictTrie {
trie_->DeleteNode(node_info.word, &node_info); trie_->DeleteNode(node_info.word, &node_info);
return true; return true;
} }
const DictUnit* Find(RuneStrArray::const_iterator begin, RuneStrArray::const_iterator end) const { const DictUnit* Find(RuneStrArray::const_iterator begin, RuneStrArray::const_iterator end) const {
return trie_->Find(begin, end); return trie_->Find(begin, end);
} }
void Find(RuneStrArray::const_iterator begin, void Find(RuneStrArray::const_iterator begin,
RuneStrArray::const_iterator end, RuneStrArray::const_iterator end,
vector<struct Dag>&res, std::vector<struct Dag>&res,
size_t max_word_len = MAX_WORD_LENGTH) const { size_t max_word_len = MAX_WORD_LENGTH) const {
trie_->Find(begin, end, res, max_word_len); trie_->Find(begin, end, res, max_word_len);
} }
bool Find(const string& word) bool Find(const std::string& word)
{ {
const DictUnit *tmp = NULL; const DictUnit *tmp = NULL;
RuneStrArray runes; RuneStrArray runes;
if (!DecodeRunesInString(word, runes)) if (!DecodeUTF8RunesInString(word, runes))
{ {
XLOG(ERROR) << "Decode failed."; XLOG(ERROR) << "Decode failed.";
} }
@ -108,18 +106,18 @@ class DictTrie {
return min_weight_; return min_weight_;
} }
void InserUserDictNode(const string& line) { void InserUserDictNode(const std::string& line) {
vector<string> buf; std::vector<std::string> buf;
DictUnit node_info; DictUnit node_info;
Split(line, buf, " "); limonp::Split(line, buf, " ");
if(buf.size() == 1){ if(buf.size() == 1){
MakeNodeInfo(node_info, MakeNodeInfo(node_info,
buf[0], buf[0],
user_word_default_weight_, user_word_default_weight_,
UNKNOWN_TAG); UNKNOWN_TAG);
} else if (buf.size() == 2) { } else if (buf.size() == 2) {
MakeNodeInfo(node_info, MakeNodeInfo(node_info,
buf[0], buf[0],
user_word_default_weight_, user_word_default_weight_,
buf[1]); buf[1]);
} else if (buf.size() == 3) { } else if (buf.size() == 3) {
@ -133,29 +131,28 @@ class DictTrie {
user_dict_single_chinese_word_.insert(node_info.word[0]); user_dict_single_chinese_word_.insert(node_info.word[0]);
} }
} }
void LoadUserDict(const vector<string>& buf) { void LoadUserDict(const std::vector<std::string>& buf) {
for (size_t i = 0; i < buf.size(); i++) { for (size_t i = 0; i < buf.size(); i++) {
InserUserDictNode(buf[i]); InserUserDictNode(buf[i]);
} }
} }
void LoadUserDict(const set<string>& buf) { void LoadUserDict(const std::set<std::string>& buf) {
std::set<string>::const_iterator iter; std::set<std::string>::const_iterator iter;
for (iter = buf.begin(); iter != buf.end(); iter++){ for (iter = buf.begin(); iter != buf.end(); iter++){
InserUserDictNode(*iter); InserUserDictNode(*iter);
} }
} }
void LoadUserDict(const string& filePaths) { void LoadUserDict(const std::string& filePaths) {
vector<string> files = limonp::Split(filePaths, "|;"); std::vector<std::string> files = limonp::Split(filePaths, "|;");
size_t lineno = 0;
for (size_t i = 0; i < files.size(); i++) { for (size_t i = 0; i < files.size(); i++) {
ifstream ifs(files[i].c_str()); std::ifstream ifs(files[i].c_str());
XCHECK(ifs.is_open()) << "open " << files[i] << " failed"; XCHECK(ifs.is_open()) << "open " << files[i] << " failed";
string line; std::string line;
for (; getline(ifs, line); lineno++) { while(getline(ifs, line)) {
if (line.size() == 0) { if (line.size() == 0) {
continue; continue;
} }
@ -166,7 +163,7 @@ class DictTrie {
private: private:
void Init(const string& dict_path, const string& user_dict_paths, UserWordWeightOption user_word_weight_opt) { void Init(const std::string& dict_path, const std::string& user_dict_paths, UserWordWeightOption user_word_weight_opt) {
LoadDict(dict_path); LoadDict(dict_path);
freq_sum_ = CalcFreqSum(static_node_infos_); freq_sum_ = CalcFreqSum(static_node_infos_);
CalculateWeight(static_node_infos_, freq_sum_); CalculateWeight(static_node_infos_, freq_sum_);
@ -178,11 +175,11 @@ class DictTrie {
Shrink(static_node_infos_); Shrink(static_node_infos_);
CreateTrie(static_node_infos_); CreateTrie(static_node_infos_);
} }
void CreateTrie(const vector<DictUnit>& dictUnits) { void CreateTrie(const std::vector<DictUnit>& dictUnits) {
assert(dictUnits.size()); assert(dictUnits.size());
vector<Unicode> words; std::vector<Unicode> words;
vector<const DictUnit*> valuePointers; std::vector<const DictUnit*> valuePointers;
for (size_t i = 0 ; i < dictUnits.size(); i ++) { for (size_t i = 0 ; i < dictUnits.size(); i ++) {
words.push_back(dictUnits[i].word); words.push_back(dictUnits[i].word);
valuePointers.push_back(&dictUnits[i]); valuePointers.push_back(&dictUnits[i]);
@ -191,15 +188,12 @@ class DictTrie {
trie_ = new Trie(words, valuePointers); trie_ = new Trie(words, valuePointers);
} }
bool MakeNodeInfo(DictUnit& node_info, bool MakeNodeInfo(DictUnit& node_info,
const string& word, const std::string& word,
double weight, double weight,
const string& tag) { const std::string& tag) {
if (!DecodeRunesInString(word, node_info.word)) { if (!DecodeUTF8RunesInString(word, node_info.word)) {
XLOG(ERROR) << "Decode " << word << " failed."; XLOG(ERROR) << "UTF-8 decode failed for dict word: " << word;
return false; return false;
} }
node_info.weight = weight; node_info.weight = weight;
@ -207,19 +201,19 @@ class DictTrie {
return true; return true;
} }
void LoadDict(const string& filePath) { void LoadDict(const std::string& filePath) {
ifstream ifs(filePath.c_str()); std::ifstream ifs(filePath.c_str());
XCHECK(ifs.is_open()) << "open " << filePath << " failed."; XCHECK(ifs.is_open()) << "open " << filePath << " failed.";
string line; std::string line;
vector<string> buf; std::vector<std::string> buf;
DictUnit node_info; DictUnit node_info;
for (size_t lineno = 0; getline(ifs, line); lineno++) { while (getline(ifs, line)) {
Split(line, buf, " "); limonp::Split(line, buf, " ");
XCHECK(buf.size() == DICT_COLUMN_NUM) << "split result illegal, line:" << line; XCHECK(buf.size() == DICT_COLUMN_NUM) << "split result illegal, line:" << line;
MakeNodeInfo(node_info, MakeNodeInfo(node_info,
buf[0], buf[0],
atof(buf[1].c_str()), atof(buf[1].c_str()),
buf[2]); buf[2]);
static_node_infos_.push_back(node_info); static_node_infos_.push_back(node_info);
} }
@ -231,8 +225,8 @@ class DictTrie {
void SetStaticWordWeights(UserWordWeightOption option) { void SetStaticWordWeights(UserWordWeightOption option) {
XCHECK(!static_node_infos_.empty()); XCHECK(!static_node_infos_.empty());
vector<DictUnit> x = static_node_infos_; std::vector<DictUnit> x = static_node_infos_;
sort(x.begin(), x.end(), WeightCompare); std::sort(x.begin(), x.end(), WeightCompare);
min_weight_ = x[0].weight; min_weight_ = x[0].weight;
max_weight_ = x[x.size() - 1].weight; max_weight_ = x[x.size() - 1].weight;
median_weight_ = x[x.size() / 2].weight; median_weight_ = x[x.size() / 2].weight;
@ -249,7 +243,7 @@ class DictTrie {
} }
} }
double CalcFreqSum(const vector<DictUnit>& node_infos) const { double CalcFreqSum(const std::vector<DictUnit>& node_infos) const {
double sum = 0.0; double sum = 0.0;
for (size_t i = 0; i < node_infos.size(); i++) { for (size_t i = 0; i < node_infos.size(); i++) {
sum += node_infos[i].weight; sum += node_infos[i].weight;
@ -257,7 +251,7 @@ class DictTrie {
return sum; return sum;
} }
void CalculateWeight(vector<DictUnit>& node_infos, double sum) const { void CalculateWeight(std::vector<DictUnit>& node_infos, double sum) const {
assert(sum > 0.0); assert(sum > 0.0);
for (size_t i = 0; i < node_infos.size(); i++) { for (size_t i = 0; i < node_infos.size(); i++) {
DictUnit& node_info = node_infos[i]; DictUnit& node_info = node_infos[i];
@ -266,12 +260,12 @@ class DictTrie {
} }
} }
void Shrink(vector<DictUnit>& units) const { void Shrink(std::vector<DictUnit>& units) const {
vector<DictUnit>(units.begin(), units.end()).swap(units); std::vector<DictUnit>(units.begin(), units.end()).swap(units);
} }
vector<DictUnit> static_node_infos_; std::vector<DictUnit> static_node_infos_;
deque<DictUnit> active_node_infos_; // must not be vector std::deque<DictUnit> active_node_infos_; // must not be std::vector
Trie * trie_; Trie * trie_;
double freq_sum_; double freq_sum_;
@ -279,7 +273,7 @@ class DictTrie {
double max_weight_; double max_weight_;
double median_weight_; double median_weight_;
double user_word_default_weight_; double user_word_default_weight_;
unordered_set<Rune> user_dict_single_chinese_word_; std::unordered_set<Rune> user_dict_single_chinese_word_;
}; };
} }

View File

@ -105,7 +105,7 @@ struct HMMModel {
XLOG(ERROR) << "emitProb illegal."; XLOG(ERROR) << "emitProb illegal.";
return false; return false;
} }
if (!DecodeRunesInString(tmp2[0], unicode) || unicode.size() != 1) { if (!DecodeUTF8RunesInString(tmp2[0], unicode) || unicode.size() != 1) {
XLOG(ERROR) << "TransCode failed."; XLOG(ERROR) << "TransCode failed.";
return false; return false;
} }

View File

@ -8,19 +8,21 @@ namespace cppjieba {
class Jieba { class Jieba {
public: public:
Jieba(const string& dict_path, Jieba(const string& dict_path = "",
const string& model_path, const string& model_path = "",
const string& user_dict_path, const string& user_dict_path = "",
const string& idfPath, const string& idf_path = "",
const string& stopWordPath) const string& stop_word_path = "")
: dict_trie_(dict_path, user_dict_path), : dict_trie_(getPath(dict_path, "jieba.dict.utf8"), getPath(user_dict_path, "user.dict.utf8")),
model_(model_path), model_(getPath(model_path, "hmm_model.utf8")),
mp_seg_(&dict_trie_), mp_seg_(&dict_trie_),
hmm_seg_(&model_), hmm_seg_(&model_),
mix_seg_(&dict_trie_, &model_), mix_seg_(&dict_trie_, &model_),
full_seg_(&dict_trie_), full_seg_(&dict_trie_),
query_seg_(&dict_trie_, &model_), query_seg_(&dict_trie_, &model_),
extractor(&dict_trie_, &model_, idfPath, stopWordPath) { extractor(&dict_trie_, &model_,
getPath(idf_path, "idf.utf8"),
getPath(stop_word_path, "stop_words.utf8")) {
} }
~Jieba() { ~Jieba() {
} }
@ -115,6 +117,39 @@ class Jieba {
} }
private: private:
static string pathJoin(const string& dir, const string& filename) {
if (dir.empty()) {
return filename;
}
char last_char = dir[dir.length() - 1];
if (last_char == '/' || last_char == '\\') {
return dir + filename;
} else {
#ifdef _WIN32
return dir + '\\' + filename;
#else
return dir + '/' + filename;
#endif
}
}
static string getCurrentDirectory() {
string path(__FILE__);
size_t pos = path.find_last_of("/\\");
return (pos == string::npos) ? "" : path.substr(0, pos);
}
static string getPath(const string& path, const string& default_file) {
if (path.empty()) {
string current_dir = getCurrentDirectory();
string parent_dir = current_dir.substr(0, current_dir.find_last_of("/\\"));
string grandparent_dir = parent_dir.substr(0, parent_dir.find_last_of("/\\"));
return pathJoin(pathJoin(grandparent_dir, "dict"), default_file);
}
return path;
}
DictTrie dict_trie_; DictTrie dict_trie_;
HMMModel model_; HMMModel model_;

View File

@ -1,37 +1,35 @@
#ifndef CPPJIEBA_KEYWORD_EXTRACTOR_H #ifndef CPPJIEBA_KEYWORD_EXTRACTOR_H
#define CPPJIEBA_KEYWORD_EXTRACTOR_H #define CPPJIEBA_KEYWORD_EXTRACTOR_H
#include <cmath> #include <algorithm>
#include <set> #include <unordered_map>
#include <unordered_set>
#include "MixSegment.hpp" #include "MixSegment.hpp"
namespace cppjieba { namespace cppjieba {
using namespace limonp;
using namespace std;
/*utf8*/ /*utf8*/
class KeywordExtractor { class KeywordExtractor {
public: public:
struct Word { struct Word {
string word; std::string word;
vector<size_t> offsets; std::vector<size_t> offsets;
double weight; double weight;
}; // struct Word }; // struct Word
KeywordExtractor(const string& dictPath, KeywordExtractor(const std::string& dictPath,
const string& hmmFilePath, const std::string& hmmFilePath,
const string& idfPath, const std::string& idfPath,
const string& stopWordPath, const std::string& stopWordPath,
const string& userDict = "") const std::string& userDict = "")
: segment_(dictPath, hmmFilePath, userDict) { : segment_(dictPath, hmmFilePath, userDict) {
LoadIdfDict(idfPath); LoadIdfDict(idfPath);
LoadStopWordDict(stopWordPath); LoadStopWordDict(stopWordPath);
} }
KeywordExtractor(const DictTrie* dictTrie, KeywordExtractor(const DictTrie* dictTrie,
const HMMModel* model, const HMMModel* model,
const string& idfPath, const std::string& idfPath,
const string& stopWordPath) const std::string& stopWordPath)
: segment_(dictTrie, model) { : segment_(dictTrie, model) {
LoadIdfDict(idfPath); LoadIdfDict(idfPath);
LoadStopWordDict(stopWordPath); LoadStopWordDict(stopWordPath);
@ -39,27 +37,27 @@ class KeywordExtractor {
~KeywordExtractor() { ~KeywordExtractor() {
} }
void Extract(const string& sentence, vector<string>& keywords, size_t topN) const { void Extract(const std::string& sentence, std::vector<std::string>& keywords, size_t topN) const {
vector<Word> topWords; std::vector<Word> topWords;
Extract(sentence, topWords, topN); Extract(sentence, topWords, topN);
for (size_t i = 0; i < topWords.size(); i++) { for (size_t i = 0; i < topWords.size(); i++) {
keywords.push_back(topWords[i].word); keywords.push_back(topWords[i].word);
} }
} }
void Extract(const string& sentence, vector<pair<string, double> >& keywords, size_t topN) const { void Extract(const std::string& sentence, std::vector<pair<std::string, double> >& keywords, size_t topN) const {
vector<Word> topWords; std::vector<Word> topWords;
Extract(sentence, topWords, topN); Extract(sentence, topWords, topN);
for (size_t i = 0; i < topWords.size(); i++) { for (size_t i = 0; i < topWords.size(); i++) {
keywords.push_back(pair<string, double>(topWords[i].word, topWords[i].weight)); keywords.push_back(pair<std::string, double>(topWords[i].word, topWords[i].weight));
} }
} }
void Extract(const string& sentence, vector<Word>& keywords, size_t topN) const { void Extract(const std::string& sentence, std::vector<Word>& keywords, size_t topN) const {
vector<string> words; std::vector<std::string> words;
segment_.Cut(sentence, words); segment_.Cut(sentence, words);
map<string, Word> wordmap; std::map<std::string, Word> wordmap;
size_t offset = 0; size_t offset = 0;
for (size_t i = 0; i < words.size(); ++i) { for (size_t i = 0; i < words.size(); ++i) {
size_t t = offset; size_t t = offset;
@ -77,8 +75,8 @@ class KeywordExtractor {
keywords.clear(); keywords.clear();
keywords.reserve(wordmap.size()); keywords.reserve(wordmap.size());
for (map<string, Word>::iterator itr = wordmap.begin(); itr != wordmap.end(); ++itr) { for (std::map<std::string, Word>::iterator itr = wordmap.begin(); itr != wordmap.end(); ++itr) {
unordered_map<string, double>::const_iterator cit = idfMap_.find(itr->first); std::unordered_map<std::string, double>::const_iterator cit = idfMap_.find(itr->first);
if (cit != idfMap_.end()) { if (cit != idfMap_.end()) {
itr->second.weight *= cit->second; itr->second.weight *= cit->second;
} else { } else {
@ -88,15 +86,15 @@ class KeywordExtractor {
keywords.push_back(itr->second); keywords.push_back(itr->second);
} }
topN = min(topN, keywords.size()); topN = min(topN, keywords.size());
partial_sort(keywords.begin(), keywords.begin() + topN, keywords.end(), Compare); std::partial_sort(keywords.begin(), keywords.begin() + topN, keywords.end(), Compare);
keywords.resize(topN); keywords.resize(topN);
} }
private: private:
void LoadIdfDict(const string& idfPath) { void LoadIdfDict(const std::string& idfPath) {
ifstream ifs(idfPath.c_str()); std::ifstream ifs(idfPath.c_str());
XCHECK(ifs.is_open()) << "open " << idfPath << " failed"; XCHECK(ifs.is_open()) << "open " << idfPath << " failed";
string line ; std::string line ;
vector<string> buf; std::vector<std::string> buf;
double idf = 0.0; double idf = 0.0;
double idfSum = 0.0; double idfSum = 0.0;
size_t lineno = 0; size_t lineno = 0;
@ -106,7 +104,7 @@ class KeywordExtractor {
XLOG(ERROR) << "lineno: " << lineno << " empty. skipped."; XLOG(ERROR) << "lineno: " << lineno << " empty. skipped.";
continue; continue;
} }
Split(line, buf, " "); limonp::Split(line, buf, " ");
if (buf.size() != 2) { if (buf.size() != 2) {
XLOG(ERROR) << "line: " << line << ", lineno: " << lineno << " empty. skipped."; XLOG(ERROR) << "line: " << line << ", lineno: " << lineno << " empty. skipped.";
continue; continue;
@ -121,10 +119,10 @@ class KeywordExtractor {
idfAverage_ = idfSum / lineno; idfAverage_ = idfSum / lineno;
assert(idfAverage_ > 0.0); assert(idfAverage_ > 0.0);
} }
void LoadStopWordDict(const string& filePath) { void LoadStopWordDict(const std::string& filePath) {
ifstream ifs(filePath.c_str()); std::ifstream ifs(filePath.c_str());
XCHECK(ifs.is_open()) << "open " << filePath << " failed"; XCHECK(ifs.is_open()) << "open " << filePath << " failed";
string line ; std::string line ;
while (getline(ifs, line)) { while (getline(ifs, line)) {
stopWords_.insert(line); stopWords_.insert(line);
} }
@ -136,18 +134,16 @@ class KeywordExtractor {
} }
MixSegment segment_; MixSegment segment_;
unordered_map<string, double> idfMap_; std::unordered_map<std::string, double> idfMap_;
double idfAverage_; double idfAverage_;
unordered_set<string> stopWords_; std::unordered_set<std::string> stopWords_;
}; // class KeywordExtractor }; // class KeywordExtractor
inline ostream& operator << (ostream& os, const KeywordExtractor::Word& word) { inline std::ostream& operator << (std::ostream& os, const KeywordExtractor::Word& word) {
return os << "{\"word\": \"" << word.word << "\", \"offset\": " << word.offsets << ", \"weight\": " << word.weight << "}"; return os << "{\"word\": \"" << word.word << "\", \"offset\": " << word.offsets << ", \"weight\": " << word.weight << "}";
} }
} // namespace cppjieba } // namespace cppjieba
#endif #endif

View File

@ -34,8 +34,8 @@ class PosTagger {
RuneStrArray runes; RuneStrArray runes;
const DictTrie * dict = segment.GetDictTrie(); const DictTrie * dict = segment.GetDictTrie();
assert(dict != NULL); assert(dict != NULL);
if (!DecodeRunesInString(str, runes)) { if (!DecodeUTF8RunesInString(str, runes)) {
XLOG(ERROR) << "Decode failed."; XLOG(ERROR) << "UTF-8 decode failed for word: " << str;
return POS_X; return POS_X;
} }
tmp = dict->Find(runes.begin(), runes.end()); tmp = dict->Find(runes.begin(), runes.end());

View File

@ -17,8 +17,8 @@ class PreFilter {
PreFilter(const unordered_set<Rune>& symbols, PreFilter(const unordered_set<Rune>& symbols,
const string& sentence) const string& sentence)
: symbols_(symbols) { : symbols_(symbols) {
if (!DecodeRunesInString(sentence, sentence_)) { if (!DecodeUTF8RunesInString(sentence, sentence_)) {
XLOG(ERROR) << "decode failed. "; XLOG(ERROR) << "UTF-8 decode failed for input sentence";
} }
cursor_ = sentence_.begin(); cursor_ = sentence_.begin();
} }

View File

@ -25,8 +25,8 @@ class SegmentBase {
bool ResetSeparators(const string& s) { bool ResetSeparators(const string& s) {
symbols_.clear(); symbols_.clear();
RuneStrArray runes; RuneStrArray runes;
if (!DecodeRunesInString(s, runes)) { if (!DecodeUTF8RunesInString(s, runes)) {
XLOG(ERROR) << "decode " << s << " failed"; XLOG(ERROR) << "UTF-8 decode failed for separators: " << s;
return false; return false;
} }
for (size_t i = 0; i < runes.size(); i++) { for (size_t i = 0; i < runes.size(); i++) {

View File

@ -84,7 +84,7 @@ struct RuneStrLite {
} }
}; // struct RuneStrLite }; // struct RuneStrLite
inline RuneStrLite DecodeRuneInString(const char* str, size_t len) { inline RuneStrLite DecodeUTF8ToRune(const char* str, size_t len) {
RuneStrLite rp(0, 0); RuneStrLite rp(0, 0);
if (str == NULL || len == 0) { if (str == NULL || len == 0) {
return rp; return rp;
@ -139,11 +139,11 @@ inline RuneStrLite DecodeRuneInString(const char* str, size_t len) {
return rp; return rp;
} }
inline bool DecodeRunesInString(const char* s, size_t len, RuneStrArray& runes) { inline bool DecodeUTF8RunesInString(const char* s, size_t len, RuneStrArray& runes) {
runes.clear(); runes.clear();
runes.reserve(len / 2); runes.reserve(len / 2);
for (uint32_t i = 0, j = 0; i < len;) { for (uint32_t i = 0, j = 0; i < len;) {
RuneStrLite rp = DecodeRuneInString(s + i, len - i); RuneStrLite rp = DecodeUTF8ToRune(s + i, len - i);
if (rp.len == 0) { if (rp.len == 0) {
runes.clear(); runes.clear();
return false; return false;
@ -156,14 +156,14 @@ inline bool DecodeRunesInString(const char* s, size_t len, RuneStrArray& runes)
return true; return true;
} }
inline bool DecodeRunesInString(const string& s, RuneStrArray& runes) { inline bool DecodeUTF8RunesInString(const string& s, RuneStrArray& runes) {
return DecodeRunesInString(s.c_str(), s.size(), runes); return DecodeUTF8RunesInString(s.c_str(), s.size(), runes);
} }
inline bool DecodeRunesInString(const char* s, size_t len, Unicode& unicode) { inline bool DecodeUTF8RunesInString(const char* s, size_t len, Unicode& unicode) {
unicode.clear(); unicode.clear();
RuneStrArray runes; RuneStrArray runes;
if (!DecodeRunesInString(s, len, runes)) { if (!DecodeUTF8RunesInString(s, len, runes)) {
return false; return false;
} }
unicode.reserve(runes.size()); unicode.reserve(runes.size());
@ -174,17 +174,17 @@ inline bool DecodeRunesInString(const char* s, size_t len, Unicode& unicode) {
} }
inline bool IsSingleWord(const string& str) { inline bool IsSingleWord(const string& str) {
RuneStrLite rp = DecodeRuneInString(str.c_str(), str.size()); RuneStrLite rp = DecodeUTF8ToRune(str.c_str(), str.size());
return rp.len == str.size(); return rp.len == str.size();
} }
inline bool DecodeRunesInString(const string& s, Unicode& unicode) { inline bool DecodeUTF8RunesInString(const string& s, Unicode& unicode) {
return DecodeRunesInString(s.c_str(), s.size(), unicode); return DecodeUTF8RunesInString(s.c_str(), s.size(), unicode);
} }
inline Unicode DecodeRunesInString(const string& s) { inline Unicode DecodeUTF8RunesInString(const string& s) {
Unicode result; Unicode result;
DecodeRunesInString(s, result); DecodeUTF8RunesInString(s, result);
return result; return result;
} }

View File

@ -1,6 +1,12 @@
SET(EXECUTABLE_OUTPUT_PATH ${PROJECT_BINARY_DIR}) SET(EXECUTABLE_OUTPUT_PATH ${PROJECT_BINARY_DIR})
if(NOT MSVC) # Configure test paths
ADD_EXECUTABLE(load_test load_test.cpp) configure_file("${CMAKE_CURRENT_SOURCE_DIR}/test_paths.h.in" "${CMAKE_BINARY_DIR}/test/test_paths.h")
ADD_SUBDIRECTORY(unittest)
endif() INCLUDE_DIRECTORIES(
${CMAKE_CURRENT_BINARY_DIR}
${CMAKE_BINARY_DIR}/test
)
ADD_EXECUTABLE(load_test load_test.cpp)
ADD_SUBDIRECTORY(unittest)

View File

@ -6,14 +6,15 @@
#include "cppjieba/MixSegment.hpp" #include "cppjieba/MixSegment.hpp"
#include "cppjieba/KeywordExtractor.hpp" #include "cppjieba/KeywordExtractor.hpp"
#include "limonp/Colors.hpp" #include "limonp/Colors.hpp"
#include "test_paths.h"
using namespace cppjieba; using namespace cppjieba;
void Cut(size_t times = 50) { void Cut(size_t times = 50) {
MixSegment seg("../dict/jieba.dict.utf8", "../dict/hmm_model.utf8"); MixSegment seg(DICT_DIR "/jieba.dict.utf8", DICT_DIR "/hmm_model.utf8");
vector<string> res; vector<string> res;
string doc; string doc;
ifstream ifs("../test/testdata/weicheng.utf8"); ifstream ifs(TEST_DATA_DIR "/weicheng.utf8");
assert(ifs); assert(ifs);
doc << ifs; doc << ifs;
long beginTime = clock(); long beginTime = clock();
@ -29,10 +30,13 @@ void Cut(size_t times = 50) {
} }
void Extract(size_t times = 400) { void Extract(size_t times = 400) {
KeywordExtractor Extractor("../dict/jieba.dict.utf8", "../dict/hmm_model.utf8", "../dict/idf.utf8", "../dict/stop_words.utf8"); KeywordExtractor Extractor(DICT_DIR "/jieba.dict.utf8",
DICT_DIR "/hmm_model.utf8",
DICT_DIR "/idf.utf8",
DICT_DIR "/stop_words.utf8");
vector<string> words; vector<string> words;
string doc; string doc;
ifstream ifs("../test/testdata/review.100"); ifstream ifs(TEST_DATA_DIR "/review.100");
assert(ifs); assert(ifs);
doc << ifs; doc << ifs;
long beginTime = clock(); long beginTime = clock();
@ -51,4 +55,4 @@ int main(int argc, char ** argv) {
Cut(); Cut();
Extract(); Extract();
return EXIT_SUCCESS; return EXIT_SUCCESS;
} }

7
test/test_paths.h.in Normal file
View File

@ -0,0 +1,7 @@
#ifndef TEST_PATHS_H
#define TEST_PATHS_H
#define TEST_DATA_DIR "@CMAKE_CURRENT_SOURCE_DIR@/testdata"
#define DICT_DIR "@CMAKE_SOURCE_DIR@/dict"
#endif // TEST_PATHS_H

View File

@ -1,6 +1,8 @@
message(STATUS "MSVC value: ${MSVC}")
if (MSVC) if (MSVC)
set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreadedDebugDLL") set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreadedDebugDLL")
set(gtest_force_shared_crt ON CACHE BOOL "" FORCE) set(gtest_force_shared_crt ON CACHE BOOL "" FORCE)
add_compile_options(/utf-8)
endif() endif()
include(FetchContent) include(FetchContent)
@ -8,7 +10,7 @@ include(FetchContent)
FetchContent_Declare( FetchContent_Declare(
googletest googletest
GIT_REPOSITORY https://github.com/google/googletest.git GIT_REPOSITORY https://github.com/google/googletest.git
GIT_TAG release-1.11.0 GIT_TAG release-1.12.1
) )
FetchContent_MakeAvailable(googletest) FetchContent_MakeAvailable(googletest)
@ -18,6 +20,12 @@ SET(LIBRARY_OUTPUT_PATH ${PROJECT_BINARY_DIR}/lib)
ADD_DEFINITIONS(-DLOGGING_LEVEL=LL_WARNING) ADD_DEFINITIONS(-DLOGGING_LEVEL=LL_WARNING)
# Add include directories
INCLUDE_DIRECTORIES(
${CMAKE_CURRENT_BINARY_DIR}
${CMAKE_BINARY_DIR}/test
)
ADD_EXECUTABLE(test.run ADD_EXECUTABLE(test.run
gtest_main.cpp gtest_main.cpp
keyword_extractor_test.cpp keyword_extractor_test.cpp

View File

@ -1,14 +1,11 @@
#include "cppjieba/Jieba.hpp" #include "cppjieba/Jieba.hpp"
#include "gtest/gtest.h" #include "gtest/gtest.h"
#include "test_paths.h"
using namespace cppjieba; using namespace cppjieba;
TEST(JiebaTest, Test1) { TEST(JiebaTest, Test0) {
cppjieba::Jieba jieba("../dict/jieba.dict.utf8", cppjieba::Jieba jieba;
"../dict/hmm_model.utf8",
"../dict/user.dict.utf8",
"../dict/idf.utf8",
"../dict/stop_words.utf8");
vector<string> words; vector<string> words;
string result; string result;
@ -38,14 +35,51 @@ TEST(JiebaTest, Test1) {
jieba.CutForSearch("他来到了网易杭研大厦", words); jieba.CutForSearch("他来到了网易杭研大厦", words);
result << words; result << words;
ASSERT_EQ("[\"\", \"来到\", \"\", \"网易\", \"杭研\", \"大厦\"]", result); ASSERT_EQ("[\"\", \"来到\", \"\", \"网易\", \"杭研\", \"大厦\"]", result);
} }
TEST(JiebaTest, Test1) {
cppjieba::Jieba jieba(DICT_DIR "/jieba.dict.utf8",
DICT_DIR "/hmm_model.utf8",
DICT_DIR "/user.dict.utf8",
DICT_DIR "/idf.utf8",
DICT_DIR "/stop_words.utf8");
vector<string> words;
string result;
jieba.Cut("他来到了网易杭研大厦", words);
result << words;
ASSERT_EQ("[\"\", \"来到\", \"\", \"网易\", \"杭研\", \"大厦\"]", result);
jieba.Cut("我来自北京邮电大学。", words, false);
result << words;
ASSERT_EQ("[\"\", \"来自\", \"北京邮电大学\", \"\"]", result);
jieba.CutSmall("南京市长江大桥", words, 3);
ASSERT_EQ("[\"南京市\", \"长江\", \"大桥\"]", result << words);
jieba.CutHMM("我来自北京邮电大学。。。学号123456", words);
result << words;
ASSERT_EQ("[\"我来\", \"自北京\", \"邮电大学\", \"\", \"\", \"\", \"学号\", \"123456\"]", result);
jieba.Cut("我来自北京邮电大学。。。学号123456用AK47", words);
result << words;
ASSERT_EQ("[\"\", \"来自\", \"北京邮电大学\", \"\", \"\", \"\", \"学号\", \"123456\", \"\", \"\", \"AK47\"]", result);
jieba.CutAll("我来自北京邮电大学", words);
result << words;
ASSERT_EQ(result, "[\"\", \"来自\", \"北京\", \"北京邮电\", \"北京邮电大学\", \"邮电\", \"邮电大学\", \"电大\", \"大学\"]");
jieba.CutForSearch("他来到了网易杭研大厦", words);
result << words;
ASSERT_EQ("[\"\", \"来到\", \"\", \"网易\", \"杭研\", \"大厦\"]", result);
}
TEST(JiebaTest, WordTest) { TEST(JiebaTest, WordTest) {
cppjieba::Jieba jieba("../dict/jieba.dict.utf8", cppjieba::Jieba jieba(DICT_DIR "/jieba.dict.utf8",
"../dict/hmm_model.utf8", DICT_DIR "/hmm_model.utf8",
"../dict/user.dict.utf8", DICT_DIR "/user.dict.utf8",
"../dict/idf.utf8", DICT_DIR "/idf.utf8",
"../dict/stop_words.utf8"); DICT_DIR "/stop_words.utf8");
vector<Word> words; vector<Word> words;
string result; string result;
@ -83,11 +117,11 @@ TEST(JiebaTest, WordTest) {
} }
TEST(JiebaTest, InsertUserWord) { TEST(JiebaTest, InsertUserWord) {
cppjieba::Jieba jieba("../dict/jieba.dict.utf8", cppjieba::Jieba jieba(DICT_DIR "/jieba.dict.utf8",
"../dict/hmm_model.utf8", DICT_DIR "/hmm_model.utf8",
"../dict/user.dict.utf8", DICT_DIR "/user.dict.utf8",
"../dict/idf.utf8", DICT_DIR "/idf.utf8",
"../dict/stop_words.utf8"); DICT_DIR "/stop_words.utf8");
vector<string> words; vector<string> words;
string result; string result;

View File

@ -1,10 +1,14 @@
#include "cppjieba/KeywordExtractor.hpp" #include "cppjieba/KeywordExtractor.hpp"
#include "gtest/gtest.h" #include "gtest/gtest.h"
#include "test_paths.h"
using namespace cppjieba; using namespace cppjieba;
TEST(KeywordExtractorTest, Test1) { TEST(KeywordExtractorTest, Test1) {
KeywordExtractor Extractor("../test/testdata/extra_dict/jieba.dict.small.utf8", "../dict/hmm_model.utf8", "../dict/idf.utf8", "../dict/stop_words.utf8"); KeywordExtractor Extractor(TEST_DATA_DIR "/extra_dict/jieba.dict.small.utf8",
DICT_DIR "/hmm_model.utf8",
DICT_DIR "/idf.utf8",
DICT_DIR "/stop_words.utf8");
{ {
string s("你好世界世界而且而且"); string s("你好世界世界而且而且");
@ -55,7 +59,11 @@ TEST(KeywordExtractorTest, Test1) {
} }
TEST(KeywordExtractorTest, Test2) { TEST(KeywordExtractorTest, Test2) {
KeywordExtractor Extractor("../test/testdata/extra_dict/jieba.dict.small.utf8", "../dict/hmm_model.utf8", "../dict/idf.utf8", "../dict/stop_words.utf8", "../test/testdata/userdict.utf8"); KeywordExtractor Extractor(TEST_DATA_DIR "/extra_dict/jieba.dict.small.utf8",
DICT_DIR "/hmm_model.utf8",
DICT_DIR "/idf.utf8",
DICT_DIR "/stop_words.utf8",
TEST_DATA_DIR "/userdict.utf8");
{ {
string s("蓝翔优秀毕业生"); string s("蓝翔优秀毕业生");

View File

@ -1,5 +1,6 @@
#include "cppjieba/MixSegment.hpp" #include "cppjieba/MixSegment.hpp"
#include "gtest/gtest.h" #include "gtest/gtest.h"
#include "test_paths.h"
using namespace cppjieba; using namespace cppjieba;
@ -13,7 +14,7 @@ static const char * const ANS_TEST3 = "[iPhone6:eng, 手机:n, 的:uj, 最大:a,
//static const char * const ANS_TEST3 = ""; //static const char * const ANS_TEST3 = "";
TEST(PosTaggerTest, Test) { TEST(PosTaggerTest, Test) {
MixSegment tagger("../dict/jieba.dict.utf8", "../dict/hmm_model.utf8"); MixSegment tagger(DICT_DIR "/jieba.dict.utf8", DICT_DIR "/hmm_model.utf8");
{ {
vector<pair<string, string> > res; vector<pair<string, string> > res;
tagger.Tag(QUERY_TEST1, res); tagger.Tag(QUERY_TEST1, res);
@ -23,7 +24,7 @@ TEST(PosTaggerTest, Test) {
} }
} }
TEST(PosTagger, TestUserDict) { TEST(PosTagger, TestUserDict) {
MixSegment tagger("../dict/jieba.dict.utf8", "../dict/hmm_model.utf8", "../test/testdata/userdict.utf8"); MixSegment tagger(DICT_DIR "/jieba.dict.utf8", DICT_DIR "/hmm_model.utf8", TEST_DATA_DIR "/userdict.utf8");
{ {
vector<pair<string, string> > res; vector<pair<string, string> > res;
tagger.Tag(QUERY_TEST2, res); tagger.Tag(QUERY_TEST2, res);

View File

@ -5,11 +5,12 @@
#include "cppjieba/FullSegment.hpp" #include "cppjieba/FullSegment.hpp"
#include "cppjieba/QuerySegment.hpp" #include "cppjieba/QuerySegment.hpp"
#include "gtest/gtest.h" #include "gtest/gtest.h"
#include "test_paths.h"
using namespace cppjieba; using namespace cppjieba;
TEST(MixSegmentTest, Test1) { TEST(MixSegmentTest, Test1) {
MixSegment segment("../dict/jieba.dict.utf8", "../dict/hmm_model.utf8");; MixSegment segment(DICT_DIR "/jieba.dict.utf8", DICT_DIR "/hmm_model.utf8");
string sentence; string sentence;
vector<string> words; vector<string> words;
string actual; string actual;
@ -49,16 +50,18 @@ TEST(MixSegmentTest, Test1) {
} }
TEST(MixSegmentTest, NoUserDict) { TEST(MixSegmentTest, NoUserDict) {
MixSegment segment("../test/testdata/extra_dict/jieba.dict.small.utf8", "../dict/hmm_model.utf8"); MixSegment segment(TEST_DATA_DIR "/extra_dict/jieba.dict.small.utf8", DICT_DIR "/hmm_model.utf8");
const char* str = "令狐冲是云计算方面的专家"; const char* str = "令狐冲是云计算方面的专家";
vector<string> words; vector<string> words;
segment.Cut(str, words); segment.Cut(str, words);
string res; string res;
ASSERT_EQ("[\"令狐冲\", \"\", \"\", \"计算\", \"方面\", \"\", \"专家\"]", res << words); ASSERT_EQ("[\"令狐冲\", \"\", \"\", \"计算\", \"方面\", \"\", \"专家\"]", res << words);
} }
TEST(MixSegmentTest, UserDict) { TEST(MixSegmentTest, UserDict) {
MixSegment segment("../test/testdata/extra_dict/jieba.dict.small.utf8", "../dict/hmm_model.utf8", "../dict/user.dict.utf8"); MixSegment segment(TEST_DATA_DIR "/extra_dict/jieba.dict.small.utf8",
DICT_DIR "/hmm_model.utf8",
DICT_DIR "/user.dict.utf8");
{ {
const char* str = "令狐冲是云计算方面的专家"; const char* str = "令狐冲是云计算方面的专家";
vector<string> words; vector<string> words;
@ -83,9 +86,10 @@ TEST(MixSegmentTest, UserDict) {
ASSERT_EQ("[\"IBM\", \",\", \"3.14\"]", res); ASSERT_EQ("[\"IBM\", \",\", \"3.14\"]", res);
} }
} }
TEST(MixSegmentTest, TestUserDict) { TEST(MixSegmentTest, TestUserDict) {
MixSegment segment("../test/testdata/extra_dict/jieba.dict.small.utf8", "../dict/hmm_model.utf8", MixSegment segment(TEST_DATA_DIR "/extra_dict/jieba.dict.small.utf8", DICT_DIR "/hmm_model.utf8",
"../test/testdata/userdict.utf8"); TEST_DATA_DIR "/userdict.utf8");
vector<string> words; vector<string> words;
string res; string res;
@ -123,8 +127,8 @@ TEST(MixSegmentTest, TestUserDict) {
} }
TEST(MixSegmentTest, TestMultiUserDict) { TEST(MixSegmentTest, TestMultiUserDict) {
MixSegment segment("../test/testdata/extra_dict/jieba.dict.small.utf8", "../dict/hmm_model.utf8", MixSegment segment(TEST_DATA_DIR "/extra_dict/jieba.dict.small.utf8", DICT_DIR "/hmm_model.utf8",
"../test/testdata/userdict.utf8;../test/testdata/userdict.2.utf8"); TEST_DATA_DIR "/userdict.utf8;" TEST_DATA_DIR "/userdict.2.utf8");
vector<string> words; vector<string> words;
string res; string res;
@ -134,7 +138,7 @@ TEST(MixSegmentTest, TestMultiUserDict) {
} }
TEST(MPSegmentTest, Test1) { TEST(MPSegmentTest, Test1) {
MPSegment segment("../dict/jieba.dict.utf8");; MPSegment segment(DICT_DIR "/jieba.dict.utf8");
string s; string s;
vector<string> words; vector<string> words;
segment.Cut("我来自北京邮电大学。", words); segment.Cut("我来自北京邮电大学。", words);
@ -163,7 +167,7 @@ TEST(MPSegmentTest, Test1) {
} }
TEST(HMMSegmentTest, Test1) { TEST(HMMSegmentTest, Test1) {
HMMSegment segment("../dict/hmm_model.utf8");; HMMSegment segment(DICT_DIR "/hmm_model.utf8");
{ {
const char* str = "我来自北京邮电大学。。。学号123456"; const char* str = "我来自北京邮电大学。。。学号123456";
const char* res[] = {"我来", "自北京", "邮电大学", "", "", "", "学号", "123456"}; const char* res[] = {"我来", "自北京", "邮电大学", "", "", "", "学号", "123456"};
@ -182,7 +186,7 @@ TEST(HMMSegmentTest, Test1) {
} }
TEST(FullSegment, Test1) { TEST(FullSegment, Test1) {
FullSegment segment("../test/testdata/extra_dict/jieba.dict.small.utf8"); FullSegment segment(TEST_DATA_DIR "/extra_dict/jieba.dict.small.utf8");
vector<string> words; vector<string> words;
string s; string s;
@ -197,7 +201,7 @@ TEST(FullSegment, Test1) {
} }
TEST(QuerySegment, Test1) { TEST(QuerySegment, Test1) {
QuerySegment segment("../dict/jieba.dict.utf8", "../dict/hmm_model.utf8", ""); QuerySegment segment(DICT_DIR "/jieba.dict.utf8", DICT_DIR "/hmm_model.utf8", "");
vector<string> words; vector<string> words;
string s1, s2; string s1, s2;
@ -218,7 +222,9 @@ TEST(QuerySegment, Test1) {
} }
TEST(QuerySegment, Test2) { TEST(QuerySegment, Test2) {
QuerySegment segment("../test/testdata/extra_dict/jieba.dict.small.utf8", "../dict/hmm_model.utf8", "../test/testdata/userdict.utf8|../test/testdata/userdict.english"); QuerySegment segment(TEST_DATA_DIR "/extra_dict/jieba.dict.small.utf8",
DICT_DIR "/hmm_model.utf8",
TEST_DATA_DIR "/userdict.utf8|" TEST_DATA_DIR "/userdict.english");
vector<string> words; vector<string> words;
string s1, s2; string s1, s2;
@ -242,14 +248,13 @@ TEST(QuerySegment, Test2) {
s2 = "中国/科学/学院/科学院/中国科学院"; s2 = "中国/科学/学院/科学院/中国科学院";
ASSERT_EQ(s1, s2); ASSERT_EQ(s1, s2);
} }
} }
TEST(MPSegmentTest, Unicode32) { TEST(MPSegmentTest, Unicode32) {
string s("天气很好,🙋 我们去郊游。"); string s("天气很好,🙋 我们去郊游。");
vector<string> words; vector<string> words;
MPSegment segment("../dict/jieba.dict.utf8");; MPSegment segment(DICT_DIR "/jieba.dict.utf8");
segment.Cut(s, words); segment.Cut(s, words);
ASSERT_EQ(Join(words.begin(), words.end(), "/"), "天气/很/好//🙋/ /我们/去/郊游/。"); ASSERT_EQ(Join(words.begin(), words.end(), "/"), "天气/很/好//🙋/ /我们/去/郊游/。");

View File

@ -1,13 +1,14 @@
#include "cppjieba/TextRankExtractor.hpp" #include "cppjieba/TextRankExtractor.hpp"
#include "gtest/gtest.h" #include "gtest/gtest.h"
#include "test_paths.h"
using namespace cppjieba; using namespace cppjieba;
TEST(TextRankExtractorTest, Test1) { TEST(TextRankExtractorTest, Test1) {
TextRankExtractor Extractor( TextRankExtractor Extractor(
"../test/testdata/extra_dict/jieba.dict.small.utf8", TEST_DATA_DIR "/extra_dict/jieba.dict.small.utf8",
"../dict/hmm_model.utf8", DICT_DIR "/hmm_model.utf8",
"../dict/stop_words.utf8"); DICT_DIR "/stop_words.utf8");
{ {
string s("你好世界世界而且而且"); string s("你好世界世界而且而且");
string res; string res;
@ -59,10 +60,10 @@ TEST(TextRankExtractorTest, Test1) {
TEST(TextRankExtractorTest, Test2) { TEST(TextRankExtractorTest, Test2) {
TextRankExtractor Extractor( TextRankExtractor Extractor(
"../test/testdata/extra_dict/jieba.dict.small.utf8", TEST_DATA_DIR "/extra_dict/jieba.dict.small.utf8",
"../dict/hmm_model.utf8", DICT_DIR "/hmm_model.utf8",
"../dict/stop_words.utf8", DICT_DIR "/stop_words.utf8",
"../test/testdata/userdict.utf8"); TEST_DATA_DIR "/userdict.utf8");
{ {
string s("\xe8\x93\x9d\xe7\xbf\x94\xe4\xbc\x98\xe7\xa7\x80\xe6\xaf\x95\xe4\xb8\x9a\xe7\x94\x9f"); string s("\xe8\x93\x9d\xe7\xbf\x94\xe4\xbc\x98\xe7\xa7\x80\xe6\xaf\x95\xe4\xb8\x9a\xe7\x94\x9f");

View File

@ -1,10 +1,11 @@
#include "cppjieba/DictTrie.hpp" #include "cppjieba/DictTrie.hpp"
#include "cppjieba/MPSegment.hpp" #include "cppjieba/MPSegment.hpp"
#include "gtest/gtest.h" #include "gtest/gtest.h"
#include "test_paths.h"
using namespace cppjieba; using namespace cppjieba;
static const char* const DICT_FILE = "../test/testdata/extra_dict/jieba.dict.small.utf8"; static const char* const DICT_FILE = TEST_DATA_DIR "/extra_dict/jieba.dict.small.utf8";
TEST(TrieTest, Empty) { TEST(TrieTest, Empty) {
vector<Unicode> keys; vector<Unicode> keys;
@ -15,7 +16,7 @@ TEST(TrieTest, Empty) {
TEST(TrieTest, Construct) { TEST(TrieTest, Construct) {
vector<Unicode> keys; vector<Unicode> keys;
vector<const DictUnit*> values; vector<const DictUnit*> values;
keys.push_back(DecodeRunesInString("")); keys.push_back(DecodeUTF8RunesInString(""));
values.push_back((const DictUnit*)(NULL)); values.push_back((const DictUnit*)(NULL));
Trie trie(keys, values); Trie trie(keys, values);
} }
@ -32,13 +33,7 @@ TEST(DictTrieTest, Test1) {
ASSERT_LT(trie.GetMinWeight() + 15.6479, 0.001); ASSERT_LT(trie.GetMinWeight() + 15.6479, 0.001);
string word("来到"); string word("来到");
cppjieba::RuneStrArray uni; cppjieba::RuneStrArray uni;
ASSERT_TRUE(DecodeRunesInString(word, uni)); ASSERT_TRUE(DecodeUTF8RunesInString(word, uni));
//DictUnit nodeInfo;
//nodeInfo.word = uni;
//nodeInfo.tag = "v";
//nodeInfo.weight = -8.87033;
//s1 << nodeInfo;
//s2 << (*trie.Find(uni.begin(), uni.end()));
const DictUnit* du = trie.Find(uni.begin(), uni.end()); const DictUnit* du = trie.Find(uni.begin(), uni.end());
ASSERT_TRUE(du != NULL); ASSERT_TRUE(du != NULL);
ASSERT_EQ(2u, du->word.size()); ASSERT_EQ(2u, du->word.size());
@ -47,45 +42,42 @@ TEST(DictTrieTest, Test1) {
ASSERT_EQ("v", du->tag); ASSERT_EQ("v", du->tag);
ASSERT_NEAR(-8.870, du->weight, 0.001); ASSERT_NEAR(-8.870, du->weight, 0.001);
//EXPECT_EQ("[\"26469\", \"21040\"] v -8.870", s2);
word = "清华大学"; word = "清华大学";
LocalVector<pair<size_t, const DictUnit*> > res; LocalVector<pair<size_t, const DictUnit*> > res;
const char * words[] = {"", "清华", "清华大学"}; const char * words[] = {"", "清华", "清华大学"};
for (size_t i = 0; i < sizeof(words)/sizeof(words[0]); i++) { for (size_t i = 0; i < sizeof(words)/sizeof(words[0]); i++) {
ASSERT_TRUE(DecodeRunesInString(words[i], uni)); ASSERT_TRUE(DecodeUTF8RunesInString(words[i], uni));
res.push_back(make_pair(uni.size() - 1, trie.Find(uni.begin(), uni.end()))); res.push_back(make_pair(uni.size() - 1, trie.Find(uni.begin(), uni.end())));
//resMap[uni.size() - 1] = trie.Find(uni.begin(), uni.end());
} }
vector<pair<size_t, const DictUnit*> > vec; vector<pair<size_t, const DictUnit*> > vec;
vector<struct Dag> dags; vector<struct Dag> dags;
ASSERT_TRUE(DecodeRunesInString(word, uni)); ASSERT_TRUE(DecodeUTF8RunesInString(word, uni));
trie.Find(uni.begin(), uni.end(), dags); trie.Find(uni.begin(), uni.end(), dags);
ASSERT_EQ(dags.size(), uni.size()); ASSERT_EQ(dags.size(), uni.size());
ASSERT_NE(dags.size(), 0u); ASSERT_NE(dags.size(), 0u);
s1 << res; s1 << res;
s2 << dags[0].nexts; s2 << dags[0].nexts;
ASSERT_EQ(s1, s2); ASSERT_EQ(s1, s2);
} }
TEST(DictTrieTest, UserDict) { TEST(DictTrieTest, UserDict) {
DictTrie trie(DICT_FILE, "../test/testdata/userdict.utf8"); DictTrie trie(DICT_FILE, TEST_DATA_DIR "/userdict.utf8");
string word = "云计算"; string word = "云计算";
cppjieba::RuneStrArray unicode; cppjieba::RuneStrArray unicode;
ASSERT_TRUE(DecodeRunesInString(word, unicode)); ASSERT_TRUE(DecodeUTF8RunesInString(word, unicode));
const DictUnit * unit = trie.Find(unicode.begin(), unicode.end()); const DictUnit * unit = trie.Find(unicode.begin(), unicode.end());
ASSERT_TRUE(unit != NULL); ASSERT_TRUE(unit != NULL);
ASSERT_NEAR(unit->weight, -14.100, 0.001); ASSERT_NEAR(unit->weight, -14.100, 0.001);
word = "蓝翔"; word = "蓝翔";
ASSERT_TRUE(DecodeRunesInString(word, unicode)); ASSERT_TRUE(DecodeUTF8RunesInString(word, unicode));
unit = trie.Find(unicode.begin(), unicode.end()); unit = trie.Find(unicode.begin(), unicode.end());
ASSERT_TRUE(unit != NULL); ASSERT_TRUE(unit != NULL);
ASSERT_EQ(unit->tag, "nz"); ASSERT_EQ(unit->tag, "nz");
ASSERT_NEAR(unit->weight, -14.100, 0.001); ASSERT_NEAR(unit->weight, -14.100, 0.001);
word = "区块链"; word = "区块链";
ASSERT_TRUE(DecodeRunesInString(word, unicode)); ASSERT_TRUE(DecodeUTF8RunesInString(word, unicode));
unit = trie.Find(unicode.begin(), unicode.end()); unit = trie.Find(unicode.begin(), unicode.end());
ASSERT_TRUE(unit != NULL); ASSERT_TRUE(unit != NULL);
ASSERT_EQ(unit->tag, "nz"); ASSERT_EQ(unit->tag, "nz");
@ -93,22 +85,22 @@ TEST(DictTrieTest, UserDict) {
} }
TEST(DictTrieTest, UserDictWithMaxWeight) { TEST(DictTrieTest, UserDictWithMaxWeight) {
DictTrie trie(DICT_FILE, "../test/testdata/userdict.utf8", DictTrie::WordWeightMax); DictTrie trie(DICT_FILE, TEST_DATA_DIR "/userdict.utf8", DictTrie::WordWeightMax);
string word = "云计算"; string word = "云计算";
cppjieba::RuneStrArray unicode; cppjieba::RuneStrArray unicode;
ASSERT_TRUE(DecodeRunesInString(word, unicode)); ASSERT_TRUE(DecodeUTF8RunesInString(word, unicode));
const DictUnit * unit = trie.Find(unicode.begin(), unicode.end()); const DictUnit * unit = trie.Find(unicode.begin(), unicode.end());
ASSERT_TRUE(unit); ASSERT_TRUE(unit);
ASSERT_NEAR(unit->weight, -2.975, 0.001); ASSERT_NEAR(unit->weight, -2.975, 0.001);
} }
TEST(DictTrieTest, Dag) { TEST(DictTrieTest, Dag) {
DictTrie trie(DICT_FILE, "../test/testdata/userdict.utf8"); DictTrie trie(DICT_FILE, TEST_DATA_DIR "/userdict.utf8");
{ {
string word = "清华大学"; string word = "清华大学";
cppjieba::RuneStrArray unicode; cppjieba::RuneStrArray unicode;
ASSERT_TRUE(DecodeRunesInString(word, unicode)); ASSERT_TRUE(DecodeUTF8RunesInString(word, unicode));
vector<struct Dag> res; vector<struct Dag> res;
trie.Find(unicode.begin(), unicode.end(), res); trie.Find(unicode.begin(), unicode.end(), res);
@ -122,7 +114,7 @@ TEST(DictTrieTest, Dag) {
{ {
string word = "北京邮电大学"; string word = "北京邮电大学";
cppjieba::RuneStrArray unicode; cppjieba::RuneStrArray unicode;
ASSERT_TRUE(DecodeRunesInString(word, unicode)); ASSERT_TRUE(DecodeUTF8RunesInString(word, unicode));
vector<struct Dag> res; vector<struct Dag> res;
trie.Find(unicode.begin(), unicode.end(), res); trie.Find(unicode.begin(), unicode.end(), res);
@ -136,7 +128,7 @@ TEST(DictTrieTest, Dag) {
{ {
string word = "长江大桥"; string word = "长江大桥";
cppjieba::RuneStrArray unicode; cppjieba::RuneStrArray unicode;
ASSERT_TRUE(DecodeRunesInString(word, unicode)); ASSERT_TRUE(DecodeUTF8RunesInString(word, unicode));
vector<struct Dag> res; vector<struct Dag> res;
trie.Find(unicode.begin(), unicode.end(), res); trie.Find(unicode.begin(), unicode.end(), res);
@ -150,7 +142,7 @@ TEST(DictTrieTest, Dag) {
{ {
string word = "长江大桥"; string word = "长江大桥";
cppjieba::RuneStrArray unicode; cppjieba::RuneStrArray unicode;
ASSERT_TRUE(DecodeRunesInString(word, unicode)); ASSERT_TRUE(DecodeUTF8RunesInString(word, unicode));
vector<struct Dag> res; vector<struct Dag> res;
trie.Find(unicode.begin(), unicode.end(), res, 3); trie.Find(unicode.begin(), unicode.end(), res, 3);
@ -164,7 +156,7 @@ TEST(DictTrieTest, Dag) {
{ {
string word = "长江大桥"; string word = "长江大桥";
cppjieba::RuneStrArray unicode; cppjieba::RuneStrArray unicode;
ASSERT_TRUE(DecodeRunesInString(word, unicode)); ASSERT_TRUE(DecodeUTF8RunesInString(word, unicode));
vector<struct Dag> res; vector<struct Dag> res;
trie.Find(unicode.begin(), unicode.end(), res, 4); trie.Find(unicode.begin(), unicode.end(), res, 4);

View File

@ -8,7 +8,7 @@ using namespace std;
TEST(UnicodeTest, Test1) { TEST(UnicodeTest, Test1) {
string s = "你好世界"; string s = "你好世界";
RuneStrArray runes; RuneStrArray runes;
ASSERT_TRUE(DecodeRunesInString(s, runes)); ASSERT_TRUE(DecodeUTF8RunesInString(s, runes));
string actual; string actual;
string expected = "[\"{\"rune\": \"20320\", \"offset\": 0, \"len\": 3}\", \"{\"rune\": \"22909\", \"offset\": 3, \"len\": 3}\", \"{\"rune\": \"19990\", \"offset\": 6, \"len\": 3}\", \"{\"rune\": \"30028\", \"offset\": 9, \"len\": 3}\"]"; string expected = "[\"{\"rune\": \"20320\", \"offset\": 0, \"len\": 3}\", \"{\"rune\": \"22909\", \"offset\": 3, \"len\": 3}\", \"{\"rune\": \"19990\", \"offset\": 6, \"len\": 3}\", \"{\"rune\": \"30028\", \"offset\": 9, \"len\": 3}\"]";
actual << runes; actual << runes;
@ -18,7 +18,7 @@ TEST(UnicodeTest, Test1) {
TEST(UnicodeTest, Illegal) { TEST(UnicodeTest, Illegal) {
string s = "123\x80"; string s = "123\x80";
RuneStrArray runes; RuneStrArray runes;
ASSERT_FALSE(DecodeRunesInString(s, runes)); ASSERT_FALSE(DecodeUTF8RunesInString(s, runes));
string actual; string actual;
string expected = "[]"; string expected = "[]";
actual << runes; actual << runes;
@ -38,6 +38,6 @@ TEST(UnicodeTest, Rand) {
s[rand() % len] = rand(); s[rand() % len] = rand();
} }
RuneStrArray runes; RuneStrArray runes;
DecodeRunesInString(s, runes); DecodeUTF8RunesInString(s, runes);
} }
} }