Update README.md

2025-07-18 00:00:12 +08:00 · 2013-09-18 11:23:25 +08:00 · 2013-09-18 11:23:25 +08:00 · 6aaa26f6cf
commit 6aaa26f6cf
parent 961575e339
1 changed files with 7 additions and 79 deletions
--- a/README.md
+++ b/README.md
@ -1,39 +1,17 @@
-#CppJieba是"结巴"中文分词的C++库
+# CppJieba是"结巴"中文分词的C++库
 ## 中文编码
 * 现在支持utf8,gbk编码的分词。默认编码是utf8。  
-## 模块详解
+`master`分支为utf8编码，此分支支持gbk编码。
-### Trie树
+## GBK版本的Demo
 Trie.cpp/Trie.h 负责载入词典的trie树，主要供Segment模块使用。
 ### Segment模块
 MPSegment.cpp/MPSegment.h 
 (Maximum Probability)最大概率法:负责根据Trie树构建有向无环图和进行动态规划算法，是分词算法的核心。
 ### TransCode模块
 TransCode.cpp/TransCode.h 负责转换编码类型，将utf8和gbk都转换成`uint16_t`类型，也负责逆转换。
 ### HMMSegment模块
 HMMSegment.cpp/HMMSegment.h
 是根据HMM模型来进行分词，主要算法思路是根据(B,E,M,S)四个状态来代表每个字的隐藏状态。
 HMM模型由dicts/下面的`hmm_model.utf8`提供。
 分词算法即viterbi算法。
 ## Demo
 ### MPSegment's demo
 __这部分的功能经过线上考验，一直稳定运行，暂时没有发现什么bug。__
 ```
 cd ./demo;
 make;
-./segment_demo testlines.utf8
+./segment_demo testlines.gbk
 ```
 Output:
@ -53,7 +31,7 @@ Output:
 ```
 cd ./demo;
 make;
-./segment_demo testlines.utf8 --modelpath ../dicts/hmm_model.utf8 --algorithm cutHMM
+./segment_demo testlines.gbk --modelpath ../dicts/hmm_model.gbk --algorithm cutHMM
 ```
 Output:
@ -70,7 +48,7 @@ Output:
 ```
 cd ./demo;
 make;
-./segment_demo testlines.utf8 --algorithm cutMix
+./segment_demo testlines.gbk --algorithm cutMix
 ```
 Output:
@ -90,57 +68,7 @@ Output:
 以上依次是MP,HMM,Mix三种方法的效果。  
 可以看出效果最好的是Mix，也就是融合MP和HMM的切词算法。即可以准确切出词典已有的词，又可以切出像"杭研"这样的未登录词。
-## Help
+## 其它详见`master`分支的README.md
 本项目主要是如下目录组成：
 ### cppcommon 
 主要是一些工具函数，例如字符串操作等。    
 make 之后产生一个libcm.a    
 要使用该libcm.a 只需要在代码里面增加  
 ```cpp
 #include "cppcommon/headers.h"  
 using namespace CPPCOMMON;  
 ``` 
 在链接时候`-Lcppcommon -lcm` 链接进即可。  
 __详细使用细节请参见demo/目录下的代码__  
 ### cppjieba
 核心目录，包含主要源代码。
 make 之后产生libcppjieb.a
 使用方法参考如上cppcommon
 ### run `./segment_demo` to get help.
 如下:
 ```
 usage:
        ./segment_demo[options] <filename>
 options:
        --algorithm     Supported encoding methods are [cutDAG, cutHMM, cutMix] for now.
                        If not specified, the default is cutDAG
        --dictpath      If not specified, the default is ../dicts/jieba.dict.utf8
        --modelpath     If not specified, the default is ../dicts/hmm_model.utf8
        --encoding      Supported encoding methods are [gbk, utf-8] for now.
                        If not specified, the default is utf8.
 example:
        ./segment_demo testlines.utf8 --encoding utf-8 --dictpath ../dicts/jieba.dict.utf8
        ./segment_demo testlines.utf8 --modelpath ../dicts/hmm_model.utf8 --algorithm cutHMM
        ./segment_demo testlines.utf8 --modelpath ../dicts/hmm_model.utf8 --algorithm cutMix
        ./segment_demo testlines.gbk --encoding gbk --dictpath ../dicts/jieba.dict.gbk
 ```
 ## 分词速度
 ### MixSegment
 分词速度大概是 65M / 78sec = 0.83M/sec
 测试环境: `Intel(R) Xeon(R) CPU  E5506  @ 2.13GHz`
 ## Contact