Merge pull request #193 from gumblex/jieba3k

jieba3k 对应更新 #192
This commit is contained in:
Sun Junyi 2014-10-25 15:29:49 +08:00
commit 4cb1924d09
13 changed files with 601 additions and 225 deletions

View File

@ -2,7 +2,6 @@
1. 提升性能词典结构由Trie改为Prefix Set内存占用减少2/3, 详见https://github.com/fxsjy/jieba/pull/187by @gumblex
2. 修复关键词提取功能的性能问题
2014-08-31: version 0.33
1. 支持自定义stop words; by @fukuball
2. 支持自定义idf词典; by @fukuball

578
README.md
View File

@ -4,24 +4,20 @@ jieba
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.
- _Scroll down for English documentation._
注意!
========
这个branch `jieba3k`是专门用于Python3.x的版本
这个branch `jieba3k` 是专门用于Python3.x的版本
Feature
特点
========
* 支持三种分词模式:
* 精确模式,试图将句子最精确地切开,适合文本分析;
* 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
* 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
* 精确模式,试图将句子最精确地切开,适合文本分析;
* 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
* 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
* 支持繁体分词
* 支持自定义词典
在线演示
=========
http://jiebademo.ap01.aws.af.cm/
@ -31,115 +27,102 @@ http://jiebademo.ap01.aws.af.cm/
网站代码https://github.com/fxsjy/jiebademo
Python 2.x 下的安装
===================
安装说明
=======
Python 2.x
-----------
* 全自动安装:`easy_install jieba` 或者 `pip install jieba`
* 半自动安装:先下载 http://pypi.python.org/pypi/jieba/ ,解压后运行 python setup.py install
* 手动安装:将 jieba 目录放置于当前目录或者 site-packages 目录
* 通过 import jieba 来引用
* 通过 `import jieba` 来引用
Python 3.x 下的安装
====================
Python 3.x
-----------
* 目前 master 分支是只支持 Python2.x 的
* Python3.x 版本的分支也已经基本可用: https://github.com/fxsjy/jieba/tree/jieba3k
* Python 3.x 版本的分支也已经基本可用: https://github.com/fxsjy/jieba/tree/jieba3k
```shell
git clone https://github.com/fxsjy/jieba.git
git checkout jieba3k
python setup.py install
```
git clone https://github.com/fxsjy/jieba.git
git checkout jieba3k
python setup.py install
* 或使用pip3安装 pip3 install jieba3k
结巴分词 Java 版本
================
作者piaolingxue
地址https://github.com/huaban/jieba-analysis
结巴分词 C++ 版本
================
作者Aszxqw
地址https://github.com/aszxqw/cppjieba
结巴分词 Node.js 版本
================
作者Aszxqw
地址https://github.com/aszxqw/nodejieba
结巴分词 Erlang 版本
================
作者falood
https://github.com/falood/exjieba
Algorithm
算法
========
* 基于 Trie 树结构实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图DAG)
* 基于前缀词典实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图 (DAG)
* 采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合
* 对于未登录词,采用了基于汉字成词能力的 HMM 模型,使用了 Viterbi 算法
功能 1):分词
==========
* `jieba.cut` 方法接受两个输入参数: 1) 第一个参数为需要分词的字符串 2cut_all 参数用来控制是否采用全模式
* `jieba.cut_for_search` 方法接受一个参数:需要分词的字符串,该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
* 注意待分词的字符串可以是gbk字符串、utf-8 字符串或者 unicode
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator可以使用 for 循环来获得分词后得到的每一个词语(unicode),也可以用 list(jieba.cut(...))转化为 list
主要功能
=======
1) :分词
--------
* `jieba.cut` 方法接受三个输入参数: 需要分词的字符串cut_all 参数用来控制是否采用全模式HMM 参数用来控制是否使用 HMM 模型
* `jieba.cut_for_search` 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
* 注意:待分词的字符串可以是 GBK 字符串、UTF-8 字符串或者 unicode
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator可以使用 for 循环来获得分词后得到的每一个词语(unicode),也可以用 list(jieba.cut(...)) 转化为 list
代码示例( 分词 )
#encoding=utf-8
import jieba
```python
#encoding=utf-8
import jieba
seg_list = jieba.cut("我来到北京清华大学",cut_all=True)
print("Full Mode:", "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode:", "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("我来到北京清华大学",cut_all=False)
print("Default Mode:", "/ ".join(seg_list)) # 精确模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode:", "/ ".join(seg_list)) # 精确模式
seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式
print(", ".join(seg_list))
seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))
```
Output:
输出:
【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
【精确模式】: 我/ 来到/ 北京/ 清华大学
【精确模式】: 我/ 来到/ 北京/ 清华大学
【新词识别】:他, 来到, 了, 网易, 杭研, 大厦 (此处“杭研”并没有在词典中但是也被Viterbi算法识别出来了)
【新词识别】:他, 来到, 了, 网易, 杭研, 大厦 (此处“杭研”并没有在词典中但是也被Viterbi算法识别出来了)
【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
功能 2) :添加自定义词典
================
2) :添加自定义词典
----------------
* 开发者可以指定自己自定义的词典,以便包含 jieba 词库里没有的词。虽然 jieba 有新词识别能力,但是自行添加新词可以保证更高的正确率
* 用法: jieba.load_userdict(file_name) # file_name 为自定义词典的路径
* 词典格式和`dict.txt`一样,一个词占一行;每一行分三部分,一部分为词语,另一部分为词频,最后为词性(可省略),用空格隔开
* 范例:
* 自定义词典https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
* 自定义词典https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
* 之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
* 之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
* 加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
* 加载自定义词库后: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
* "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
功能 3) :关键词提取
================
* jieba.analyse.extract_tags(sentence,topK) #需要先 import jieba.analyse
* setence 为待提取的文本
3) :关键词提取
-------------
* jieba.analyse.extract_tags(sentence,topK,withWeight) #需要先 `import jieba.analyse`
* sentence 为待提取的文本
* topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
* withWeight 为是否一并返回关键词权重值,默认值为 False
代码示例 (关键词提取)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
关键词提取所使用逆向文件频率IDF文本语料库可以切换成自定义语料库的路径
@ -153,44 +136,78 @@ Output:
* 自定义语料库示例https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
功能 4) : 词性标注
================
关键词一并返回关键词权重值示例
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py
#### 基于TextRank算法的关键词抽取实现
算法论文: [TextRank: Bringing Order into Texts](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
##### 基本思想:
1. 将待抽取关键词的文本进行分词
2. 以固定窗口大小(我选的5可适当调整),词之间的共现关系,构建图
3. 计算图中节点的PageRank注意是无向带权图
##### 基本使用:
jieba.analyse.textrank(raw_text)
##### 示例结果:
来自`__main__`的示例结果:
```
吉林 100.0
欧亚 86.4592606421
置业 55.3262889963
实现 52.0353476663
收入 37.9475518129
增资 35.5042189944
子公司 34.9286032861
全资 30.8154823412
城市 30.6031961172
商业 30.4779050167
```
4) : 词性标注
-----------
* 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法
* 用法示例
>>> import jieba.posseg as pseg
>>> words = pseg.cut("我爱北京天安门")
>>> for w in words:
... print w.word, w.flag
...
我 r
爱 v
北京 ns
天安门 ns
```pycon
>>> import jieba.posseg as pseg
>>> words = pseg.cut("我爱北京天安门")
>>> for w in words:
... print(w.word, w.flag)
...
我 r
爱 v
北京 ns
天安门 ns
```
功能 5) : 并行分词
==================
5) : 并行分词
-----------
* 原理:将目标文本按行分隔后,把各行文本分配到多个 python 进程并行分词,然后归并结果,从而获得分词速度的可观提升
* 基于 python 自带的 multiprocessing 模块,目前暂不支持 windows
* 用法:
* `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数
* `jieba.disable_parallel()` # 关闭并行分词模式
* `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数
* `jieba.disable_parallel()` # 关闭并行分词模式
* 例子:
https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
* 例子https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
* 实验结果:在 4 核 3.4GHz Linux 机器上,对金庸全集进行精确分词,获得了 1MB/s 的速度,是单进程版的 3.3 倍。
功能 6) : Tokenize返回词语在原文的起始位置
============================================
* 注意,输入参数只接受 str
6) : Tokenize返回词语在原文的起始位置
----------------------------------
* 注意,输入参数只接受 unicode
* 默认模式
```python
result = jieba.tokenize('永和服装饰品有限公司')
result = jieba.tokenize(u'永和服装饰品有限公司')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0], tk[1], tk[2]))
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
```
```
@ -204,9 +221,9 @@ word 有限公司 start: 6 end:10
* 搜索模式
```python
result = jieba.tokenize('永和服装饰品有限公司', mode='search')
result = jieba.tokenize(u'永和服装饰品有限公司',mode='search')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0], tk[1], tk[2]))
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
```
```
@ -219,11 +236,80 @@ word 有限公司 start: 6 end:10
```
功能 7) : ChineseAnalyzer for Whoosh 搜索引擎
============================================
* 引用: `from jieba.analyse import ChineseAnalyzer `
7) : ChineseAnalyzer for Whoosh 搜索引擎
--------------------------------------------
* 引用: `from jieba.analyse import ChineseAnalyzer`
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
8) : 命令行分词
-------------------
使用示例:`cat news.txt | python -m jieba > cut_result.txt`
命令行选项(翻译):
使用: python -m jieba [options] filename
结巴命令行界面。
固定参数:
filename 输入文件
可选参数:
-h, --help 显示此帮助信息并退出
-d [DELIM], --delimiter [DELIM]
使用 DELIM 分隔词语,而不是用默认的' / '。
若不指定 DELIM则使用一个空格分隔。
-D DICT, --dict DICT 使用 DICT 代替默认词典
-u USER_DICT, --user-dict USER_DICT
使用 USER_DICT 作为附加词典,与默认词典或自定义词典配合使用
-a, --cut-all 全模式分词
-n, --no-hmm 不使用隐含马尔可夫模型
-q, --quiet 不输出载入信息到 STDERR
-V, --version 显示版本信息并退出
如果没有指定文件名,则使用标准输入。
`--help` 选项输出:
$> python -m jieba --help
usage: python -m jieba [options] filename
Jieba command line interface.
positional arguments:
filename input file
optional arguments:
-h, --help show this help message and exit
-d [DELIM], --delimiter [DELIM]
use DELIM instead of ' / ' for word delimiter; or a
space if it is used without DELIM
-D DICT, --dict DICT use DICT as dictionary
-u USER_DICT, --user-dict USER_DICT
use USER_DICT together with the default dictionary or
DICT (if specified)
-a, --cut-all full pattern cutting
-n, --no-hmm don't use the Hidden Markov Model
-q, --quiet don't print loading messages to stderr
-V, --version show program's version number and exit
If no filename specified, use STDIN instead.
模块初始化机制的改变:lazy load 从0.28版本开始)
-------------------------------------------
jieba 采用延迟加载,"import jieba" 不会立即触发词典的加载,一旦有必要才开始加载词典构建前缀字典。如果你想手工初始 jieba也可以手动初始化。
import jieba
jieba.initialize() # 手动初始化(可选)
在 0.28 之前的版本是不能指定主词典的路径的,有了延迟加载机制后,你可以改变主词典的路径:
jieba.set_dictionary('data/dict.txt.big')
例子: https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py
其他词典
========
@ -233,47 +319,55 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
2. 支持繁体分词更好的词典文件
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
下载你所需要的词典然后覆盖jieba/dict.txt 即可或者用 `jieba.set_dictionary('data/dict.txt.big')`
模块初始化机制的改变:lazy load 从0.28版本开始)
================================================
jieba 采用延迟加载,"import jieba" 不会立即触发词典的加载一旦有必要才开始加载词典构建trie。如果你想手工初始 jieba也可以手动初始化。
import jieba
jieba.initialize() # 手动初始化(可选)
在 0.28 之前的版本是不能指定主词典的路径的,有了延迟加载机制后,你可以改变主词典的路径:
jieba.set_dictionary('data/dict.txt.big')
例子: https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py
下载你所需要的词典,然后覆盖 jieba/dict.txt 即可;或者用 `jieba.set_dictionary('data/dict.txt.big')`
其他语言实现
==========
结巴分词 Java 版本
----------------
作者piaolingxue
地址https://github.com/huaban/jieba-analysis
结巴分词 C++ 版本
----------------
作者Aszxqw
地址https://github.com/aszxqw/cppjieba
结巴分词 Node.js 版本
----------------
作者Aszxqw
地址https://github.com/aszxqw/nodejieba
结巴分词 Erlang 版本
----------------
作者falood
地址https://github.com/falood/exjieba
系统集成
========
1. Solr: https://github.com/sing1ee/jieba-solr
分词速度
=========
* 1.5 MB / Second in Full Mode
* 400 KB / Second in Default Mode
* Test Env: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz;《围城》.txt
* 测试环境: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz;《围城》.txt
常见问题
=========
1模型的数据是如何生成的https://github.com/fxsjy/jieba/issues/7
1. 模型的数据是如何生成的https://github.com/fxsjy/jieba/issues/7
2. 这个库的授权是? https://github.com/fxsjy/jieba/issues/2
2这个库的授权是? https://github.com/fxsjy/jieba/issues/2
* 更多问题请点击https://github.com/fxsjy/jieba/issues?sort=updated&state=closed
更多问题请点击https://github.com/fxsjy/jieba/issues?sort=updated&state=closed
Change Log
修订历史
==========
https://github.com/fxsjy/jieba/blob/master/Changelog
--------------------
jieba
========
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.
@ -281,114 +375,207 @@ jieba
Features
========
* Support three types of segmentation mode:
* 1) Accurate Mode, attempt to cut the sentence into the most accurate segmentation, which is suitable for text analysis;
* 2) Full Mode, break the words of the sentence into words scanned
* 3) Search Engine Mode, based on the Accurate Mode, with an attempt to cut the long words into several short words, which can enhance the recall rate
* 1) Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
* 2) Full Mode gets all the possible words from the sentence. Fast but not accurate.
* 3) Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
Usage
========
* Fully automatic installation: `easy_install jieba` or `pip install jieba`
* Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/ , after extracting run `python setup.py install`
* Manutal installation: place the `jieba` directory in the current directory or python site-packages directory.
* Use `import jieba` to import, which will first build the Trie tree only on first import (takes a few seconds).
* Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/ , run `python setup.py install` after extracting.
* Manual installation: place the `jieba` directory in the current directory or python `site-packages` directory.
* `import jieba`.
Algorithm
========
* Based on the Trie tree structure to achieve efficient word graph scanning; sentences using Chinese characters constitute a directed acyclic graph (DAG)
* Employs memory search to calculate the maximum probability path, in order to identify the maximum tangential points based on word frequency combination
* For unknown words, the character position HMM-based model is used, using the Viterbi algorithm
* Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.
* Use dynamic programming to find the most probable combination based on the word frequency.
* For unknown words, a HMM-based model is used with the Viterbi algorithm.
Function 1): cut
==========
* The `jieba.cut` method accepts to input parameters: 1) the first parameter is the string that requires segmentation, and the 2) second parameter is `cut_all`, a parameter used to control the segmentation pattern.
* `jieba.cut` returned structure is an iterative generator, where you can use a `for` loop to get the word segmentation (in unicode), or `list(jieba.cut( ... ))` to create a list.
* `jieba.cut_for_search` accpets only on parameter: the string that requires segmentation, and it will cut the sentence into short words
Main Functions
==============
Code example: segmentation
==========
1) : Cut
--------
* The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
* `jieba.cut` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode), or `list(jieba.cut( ... ))` to create a list.
* `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
#encoding=utf-8
import jieba
**Code example: segmentation**
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode:", "/ ".join(seg_list)) # 全模式
```python
#encoding=utf-8
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode:", "/ ".join(seg_list)) # 默认模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode:", "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("他来到了网易杭研大厦")
print(", ".join(seg_list))
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode:", "/ ".join(seg_list)) # 精确模式
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))
seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))
```
Output:
[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
[Accurate Mode]: 我/ 来到/ 北京/ 清华大学
[Accurate Mode]: 我/ 来到/ 北京/ 清华大学
[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)
[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)
[Search Engine Mode] 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在
, 日本, 京都, 大学, 日本京都大学, 深造
[Search Engine Mode] 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
Function 2): Add a custom dictionary
==========
2) : Add a custom dictionary
----------------------------
* Developers can specify their own custom dictionary to include in the jieba thesaurus. jieba has the ability to identify new words, but adding your own new words can ensure a higher rate of correct segmentation.
* Usage `jieba.load_userdict(file_name) # file_name is a custom dictionary path`
* Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but adding your own new words can ensure a higher accuracy.
* Usage `jieba.load_userdict(file_name) # file_name is the path of the custom dictionary`
* The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space
* Example
云计算 5
李小福 2
创新办 3
云计算 5
李小福 2
创新办 3
[Before] 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
[Before] 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /
[After]: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
[After]: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
Function 3): Keyword Extraction
================
* `jieba.analyse.extract_tags(sentence,topK) # needs to first import jieba.analyse`
* `setence`: the text to be extracted
* `topK`: To return several TF / IDF weights for the biggest keywords, the default value is 20
3) : Keyword Extraction
-----------------------
* `jieba.analyse.extract_tags(sentence,topK,withWeight) # needs to first import jieba.analyse`
* `sentence`: the text to be extracted
* `topK`: return how many keywords with the highest TF/IDF weights. The default value is 20
* `withWeight`: whether return TF/IDF weights with the keywords. The default value is False
Code sample (keyword extraction)
Example (keyword extraction)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
Developers can specify their own custom IDF corpus in jieba keyword extraction
* Usage `jieba.analyse.set_idf_path(file_name) # file_name is a custom corpus path`
* Usage `jieba.analyse.set_idf_path(file_name) # file_name is the path for the custom corpus`
* Custom Corpus Samplehttps://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
* Sample Codehttps://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py
Developers can specify their own custom stop words corpus in jieba keyword extraction
* Usage `jieba.analyse.set_stop_words(file_name) # file_name is a custom corpus path`
* Usage `jieba.analyse.set_stop_words(file_name) # file_name is the path for the custom corpus`
* Custom Corpus Samplehttps://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
* Sample Codehttps://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
Using Other Dictionaries
========
It is possible to supply Jieba with your own custom dictionary, and there are also two dictionaries readily available for download:
There's also a [TextRank](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) implementation available.
1. You can employ a smaller dictionary for a smaller memory footprint:
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
Use: `jieba.analyse.textrank(raw_text)`.
2. There is also a bigger file that has better support for traditional characters (繁體):
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
4) : Part of Speech Tagging
-----------
* Tags the POS of each word after segmentation, using labels compatible with ictclas.
* Example:
By default, an in-between dictionary is used, called `dict.txt` and included in the distribution.
```pycon
>>> import jieba.posseg as pseg
>>> words = pseg.cut("我爱北京天安门")
>>> for w in words:
... print(w.word, w.flag)
...
我 r
爱 v
北京 ns
天安门 ns
```
In either case, download the file you want first, and then call `jieba.set_dictionary('data/dict.txt.big')` or just replace the existing `dict.txt`.
5) : Parallel Processing
-----------
* Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
* Based on the multiprocessing module of Python.
* Usage:
* `jieba.enable_parallel(4)` # Enable parallel processing. The parameter is the number of processes.
* `jieba.disable_parallel()` # Disable parallel processing.
* Example:
https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
* Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
6) : Tokenize: return words with position
----------------------------------
* The input must be unicode
* Default mode
```python
result = jieba.tokenize(u'永和服装饰品有限公司')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
```
```
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限公司 start: 6 end:10
```
* Search mode
```python
result = jieba.tokenize(u'永和服装饰品有限公司',mode='search')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
```
```
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限 start: 6 end:8
word 公司 start: 8 end:10
word 有限公司 start: 6 end:10
```
7) : ChineseAnalyzer for Whoosh
--------------------------------------------
* `from jieba.analyse import ChineseAnalyzer`
* Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
8) : Command Line Interface
-------------------
$> python -m jieba --help
usage: python -m jieba [options] filename
Jieba command line interface.
positional arguments:
filename input file
optional arguments:
-h, --help show this help message and exit
-d [DELIM], --delimiter [DELIM]
use DELIM instead of ' / ' for word delimiter; or a
space if it is used without DELIM
-D DICT, --dict DICT use DICT as dictionary
-u USER_DICT, --user-dict USER_DICT
use USER_DICT together with the default dictionary or
DICT (if specified)
-a, --cut-all full pattern cutting
-n, --no-hmm don't use the Hidden Markov Model
-q, --quiet don't print loading messages to stderr
-V, --version show program's version number and exit
If no filename specified, use STDIN instead.
Initialization
========
By default, Jieba employs lazy loading to only build the trie once it is necessary. This takes 1-3 seconds once, after which it is not initialized again. If you want to initialize Jieba manually, you can call:
---------------
By default, Jieba don't build the prefix dictionary unless it's necessary. This takes 1-3 seconds, after which it is not initialized again. If you want to initialize Jieba manually, you can call:
import jieba
jieba.initialize() # (optional)
@ -397,6 +584,21 @@ You can also specify the dictionary (not supported before version 0.28) :
jieba.set_dictionary('data/dict.txt.big')
Using Other Dictionaries
========
It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:
1. A smaller dictionary for a smaller memory footprint:
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
2. There is also a bigger dictionary that has better support for traditional Chinese (繁體):
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
By default, an in-between dictionary is used, called `dict.txt` and included in the distribution.
In either case, download the file you want, and then call `jieba.set_dictionary('data/dict.txt.big')` or just replace the existing `dict.txt`.
Segmentation speed
=========
* 1.5 MB / Second in Full Mode

View File

@ -6,7 +6,10 @@ from argparse import ArgumentParser
parser = ArgumentParser(usage="%s -m jieba [options] filename" % sys.executable, description="Jieba command line interface.", epilog="If no filename specified, use STDIN instead.")
parser.add_argument("-d", "--delimiter", metavar="DELIM", default=' / ',
nargs='?', const=' ',
help="use DELIM instead of ' / ' for word delimiter; use a space if it is without DELIM")
help="use DELIM instead of ' / ' for word delimiter; or a space if it is used without DELIM")
parser.add_argument("-D", "--dict", help="use DICT as dictionary")
parser.add_argument("-u", "--user-dict",
help="use USER_DICT together with the default dictionary or DICT (if specified)")
parser.add_argument("-a", "--cut-all",
action="store_true", dest="cutall", default=False,
help="full pattern cutting")
@ -14,23 +17,30 @@ parser.add_argument("-n", "--no-hmm", dest="hmm", action="store_false",
default=True, help="don't use the Hidden Markov Model")
parser.add_argument("-q", "--quiet", action="store_true", default=False,
help="don't print loading messages to stderr")
parser.add_argument("-V", '--version', action='version', version="Jieba " + jieba.__version__)
parser.add_argument("-V", '--version', action='version',
version="Jieba " + jieba.__version__)
parser.add_argument("filename", nargs='?', help="input file")
args = parser.parse_args()
if args.quiet:
jieba.setLogLevel(60)
jieba.setLogLevel(60)
delim = str(args.delimiter)
cutall = args.cutall
hmm = args.hmm
fp = open(args.filename, 'r') if args.filename else sys.stdin
jieba.initialize()
if args.dict:
jieba.initialize(args.dict)
else:
jieba.initialize()
if args.user_dict:
jieba.load_userdict(args.user_dict)
ln = fp.readline()
while ln:
l = ln.rstrip('\r\n')
print(delim.join(jieba.cut(ln.rstrip('\r\n'), cutall, hmm)))
ln = fp.readline()
l = ln.rstrip('\r\n')
print(delim.join(jieba.cut(ln.rstrip('\r\n'), cutall, hmm)))
ln = fp.readline()
fp.close()

View File

@ -5,6 +5,7 @@ try:
from .analyzer import ChineseAnalyzer
except ImportError:
pass
from .textrank import textrank
_curpath = os.path.normpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
abs_path = os.path.join(_curpath, "idf.txt")
@ -58,7 +59,7 @@ def set_stop_words(stop_words_path):
for line in lines:
STOP_WORDS.add(line)
def extract_tags(sentence, topK=20):
def extract_tags(sentence, topK=20, withWeight=False):
global STOP_WORDS
idf_freq, median_idf = idf_loader.get_idf()
@ -77,6 +78,9 @@ def extract_tags(sentence, topK=20):
tf_idf_list = [(v*idf_freq.get(k,median_idf), k) for k,v in freq]
st_list = sorted(tf_idf_list, reverse=True)
top_tuples = st_list[:topK]
tags = [a[1] for a in top_tuples]
if withWeight:
tags = st_list[:topK]
else:
top_tuples = st_list[:topK]
tags = [a[1] for a in top_tuples]
return tags

View File

@ -1,6 +1,6 @@
#encoding=utf-8
from whoosh.analysis import RegexAnalyzer,LowercaseFilter,StopFilter,StemFilter
from whoosh.analysis import Tokenizer,Token
from whoosh.analysis import Tokenizer,Token
from whoosh.lang.porter import stem
import jieba

74
jieba/analyse/textrank.py Normal file
View File

@ -0,0 +1,74 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import jieba.posseg as pseg
import collections
import sys
class UndirectWeightedGraph:
d = 0.85
def __init__(self):
self.graph = collections.defaultdict(list)
def addEdge(self, start, end, weight):
# use a tuple (start, end, weight) instead of a Edge object
self.graph[start].append((start, end, weight))
self.graph[end].append((end, start, weight))
def rank(self):
ws = collections.defaultdict(float)
outSum = collections.defaultdict(float)
wsdef = 1.0 / len(self.graph)
for n, out in self.graph.items():
ws[n] = wsdef
outSum[n] = sum((e[2] for e in out), 0.0)
for x in range(10): # 10 iters
for n, inedges in self.graph.items():
s = 0
for e in inedges:
s += e[2] / outSum[e[1]] * ws[e[1]]
ws[n] = (1 - self.d) + self.d * s
(min_rank, max_rank) = (sys.float_info[0], sys.float_info[3])
for w in ws.values():
if w < min_rank:
min_rank = w
elif w > max_rank:
max_rank = w
for n, w in ws.items():
ws[n] = (w - min_rank / 10.0) / (max_rank - min_rank / 10.0) * 100
return ws
def textrank(raw, topk=10):
pos_filt = frozenset(('ns', 'n', 'vn', 'v'))
g = UndirectWeightedGraph()
cm = collections.defaultdict(int)
span = 5
words = [x for x in pseg.cut(raw)]
for i in range(len(words)):
if words[i].flag in pos_filt:
for j in range(i + 1, i + span):
if j >= len(words):
break
if words[j].flag not in pos_filt:
continue
cm[(words[i].word, words[j].word)] += 1
for terms, w in cm.items():
g.addEdge(terms[0], terms[1], w)
nodes_rank = g.rank()
nrs = sorted(nodes_rank.items(), key=lambda x: x[1], reverse=True)
return nrs[:topk]
if __name__ == '__main__':
s = "此外公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元增资后吉林欧亚置业注册资本由7000万元增加到5亿元。吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城市商业综合体项目。2013年实现营业收入0万元实现净利润-139.13万元。"
for x, w in textrank(s):
print(x, w)

View File

@ -14,32 +14,33 @@ PROB_EMIT_P = "prob_emit.p"
CHAR_STATE_TAB_P = "char_state_tab.p"
def load_model(f_name, isJython=True):
_curpath=os.path.normpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
_curpath = os.path.normpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
result = {}
with open(f_name, "rb") as f:
for line in open(f_name,"rb"):
for line in f:
line = line.strip()
if not line: continue
if not line:
continue
line = line.decode("utf-8")
word, _, tag = line.split(" ")
result[word] = tag
f.closed
if not isJython:
return result
start_p = {}
abs_path = os.path.join(_curpath, PROB_START_P)
with open(abs_path, mode='rb') as f:
start_p = marshal.load(f)
f.closed
trans_p = {}
abs_path = os.path.join(_curpath, PROB_TRANS_P)
with open(abs_path, 'rb') as f:
trans_p = marshal.load(f)
f.closed
emit_p = {}
abs_path = os.path.join(_curpath, PROB_EMIT_P)
with open(abs_path, 'rb') as f:
@ -62,14 +63,14 @@ else:
word_tag_tab = load_model(jieba.get_abs_path_dict(), isJython=False)
def makesure_userdict_loaded(fn):
@wraps(fn)
def wrapped(*args,**kwargs):
if len(jieba.user_word_tag_tab)>0:
if jieba.user_word_tag_tab:
word_tag_tab.update(jieba.user_word_tag_tab)
jieba.user_word_tag_tab = {}
return fn(*args,**kwargs)
return wrapped
class pair(object):
@ -152,7 +153,7 @@ def __cut_DAG_NO_HMM(sentence):
def __cut_DAG(sentence):
DAG = jieba.get_DAG(sentence)
route = {}
jieba.calc(sentence,DAG,0,route=route)
x = 0

View File

@ -3,8 +3,7 @@ MIN_FLOAT = -3.14e100
MIN_INF = float("-inf")
def get_top_states(t_state_v, K=4):
items = t_state_v.items()
topK = sorted(items, key=operator.itemgetter(1), reverse=True)[:K]
topK = sorted(t_state_v.items(), key=operator.itemgetter(1), reverse=True)[:K]
return [x[0] for x in topK]
def viterbi(obs, states, start_p, trans_p, emit_p):

View File

@ -1,11 +1,11 @@
from distutils.core import setup
setup(name='jieba3k',
version='0.34',
description='Chinese Words Segementation Utilities',
author='Sun, Junyi',
author_email='ccnusjy@gmail.com',
url='http://github.com/fxsjy',
packages=['jieba'],
from distutils.core import setup
setup(name='jieba3k',
version='0.34',
description='Chinese Words Segementation Utilities',
author='Sun, Junyi',
author_email='ccnusjy@gmail.com',
url='http://github.com/fxsjy',
packages=['jieba'],
package_dir={'jieba':'jieba'},
package_data={'jieba':['*.*','finalseg/*','analyse/*','posseg/*']}
)
)

View File

@ -0,0 +1,43 @@
import sys
sys.path.append('../')
import jieba
import jieba.analyse
from optparse import OptionParser
USAGE = "usage: python extract_tags_with_weight.py [file name] -k [top k] -w [with weight=1 or 0]"
parser = OptionParser(USAGE)
parser.add_option("-k", dest="topK")
parser.add_option("-w", dest="withWeight")
opt, args = parser.parse_args()
if len(args) < 1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK is None:
topK = 10
else:
topK = int(opt.topK)
if opt.withWeight is None:
withWeight = False
else:
if int(opt.withWeight) is 1:
withWeight = True
else:
withWeight = False
content = open(file_name, 'rb').read()
tags = jieba.analyse.extract_tags(content, topK=topK, withWeight=withWeight)
if withWeight is True:
for tag in tags:
print("tag: %s\t\t weight: %f" % (tag[1],tag[0]))
else:
print(",".join(tags))

44
test/lyric.txt Normal file
View File

@ -0,0 +1,44 @@
我沒有心
我沒有真實的自我
我只有消瘦的臉孔
所謂軟弱
所謂的順從一向是我
的座右銘
而我
沒有那海洋的寬闊
我只要熱情的撫摸
所謂空洞
所謂不安全感是我
的墓誌銘
而你
是否和我一般怯懦
是否和我一般矯作
和我一般囉唆
而你
是否和我一般退縮
是否和我一般肌迫
一般地困惑
我沒有力
我沒有滿腔的熱火
我只有滿肚的如果
所謂勇氣
所謂的認同感是我
隨便說說
而你
是否和我一般怯懦
是否和我一般矯作
是否對你來說
只是一場遊戲
雖然沒有把握
而你
是否和我一般退縮
是否和我一般肌迫
是否對你來說
只是逼不得已
雖然沒有藉口

View File

@ -6,7 +6,7 @@ import jieba.posseg as pseg
def cuttest(test_sent):
result = pseg.cut(test_sent)
for w in result:
print(w.word, "/", w.flag, ", ", end=' ')
print(w.word, "/", w.flag, ", ", end=' ')
print("")

View File

@ -6,7 +6,7 @@ from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser
from jieba.analyse import ChineseAnalyzer
from jieba.analyse import ChineseAnalyzer
analyzer = ChineseAnalyzer()
@ -23,7 +23,7 @@ with open(file_name,"rb") as inf:
for line in inf:
i+=1
writer.add_document(
title="line"+str(i),
title="line"+str(i),
path="/a",
content=line.decode('gbk','ignore')
)
@ -36,6 +36,6 @@ for keyword in ("水果小姐","你","first","中文","交换机","交换"):
print("result of ",keyword)
q = parser.parse(keyword)
results = searcher.search(q)
for hit in results:
for hit in results:
print(hit.highlights("content"))
print("="*10)