mirror of
https://github.com/fxsjy/jieba.git
synced 2025-07-10 00:01:33 +08:00
wraps most globals in classes
API changes: * class jieba.Tokenizer, jieba.posseg.POSTokenizer * class jieba.analyse.TFIDF, jieba.analyse.TextRank * global functions are mapped to jieba.(posseg.)dt, the default (POS)Tokenizer * multiprocessing only works with jieba.(posseg.)dt * new lcut, lcut_for_search functions that returns a list * jieba.analyse.textrank now returns 20 items by default Tests: * added test_lock.py to test multithread locking * demo.py now contains most of the examples in README
This commit is contained in:
parent
e359d08964
commit
94840a734c
145
README.md
145
README.md
@ -45,17 +45,19 @@ http://jiebademo.ap01.aws.af.cm/
|
|||||||
|
|
||||||
主要功能
|
主要功能
|
||||||
=======
|
=======
|
||||||
1) :分词
|
1. 分词
|
||||||
--------
|
--------
|
||||||
* `jieba.cut` 方法接受三个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用 HMM 模型
|
* `jieba.cut` 方法接受三个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用 HMM 模型
|
||||||
* `jieba.cut_for_search` 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
|
* `jieba.cut_for_search` 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
|
||||||
* 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意:不建议直接输入 GBK 字符串,可能无法预料地错误解码成 UTF-8
|
* 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意:不建议直接输入 GBK 字符串,可能无法预料地错误解码成 UTF-8
|
||||||
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator,可以使用 for 循环来获得分词后得到的每一个词语(unicode),也可以用 list(jieba.cut(...)) 转化为 list
|
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator,可以使用 for 循环来获得分词后得到的每一个词语(unicode),或者用
|
||||||
|
* `jieba.lcut` 以及 `jieba.lcut_for_search` 直接返回 list
|
||||||
|
* `jieba.Tokenizer(dictionary=DEFAULT_DICT)` 新建自定义分词器,可用于同时使用不同词典。`jieba.dt` 为默认分词器,所有全局分词相关函数都是该分词器的映射。
|
||||||
|
|
||||||
代码示例( 分词 )
|
代码示例
|
||||||
|
|
||||||
```python
|
```python
|
||||||
#encoding=utf-8
|
# encoding=utf-8
|
||||||
import jieba
|
import jieba
|
||||||
|
|
||||||
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
|
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
|
||||||
@ -81,7 +83,7 @@ print(", ".join(seg_list))
|
|||||||
|
|
||||||
【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
|
【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
|
||||||
|
|
||||||
2) :添加自定义词典
|
2. 添加自定义词典
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
### 载入词典
|
### 载入词典
|
||||||
@ -91,6 +93,8 @@ print(", ".join(seg_list))
|
|||||||
* 词典格式和`dict.txt`一样,一个词占一行;每一行分三部分,一部分为词语,另一部分为词频(可省略),最后为词性(可省略),用空格隔开
|
* 词典格式和`dict.txt`一样,一个词占一行;每一行分三部分,一部分为词语,另一部分为词频(可省略),最后为词性(可省略),用空格隔开
|
||||||
* 词频可省略,使用计算出的能保证分出该词的词频
|
* 词频可省略,使用计算出的能保证分出该词的词频
|
||||||
|
|
||||||
|
* 更改分词器的 tmp_dir 和 cache_file 属性,可指定缓存文件位置,用于受限的文件系统。
|
||||||
|
|
||||||
* 范例:
|
* 范例:
|
||||||
|
|
||||||
* 自定义词典:https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
|
* 自定义词典:https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
|
||||||
@ -128,12 +132,18 @@ print(", ".join(seg_list))
|
|||||||
|
|
||||||
* "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
|
* "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
|
||||||
|
|
||||||
3) :关键词提取
|
3. 关键词提取
|
||||||
-------------
|
-------------
|
||||||
* jieba.analyse.extract_tags(sentence,topK,withWeight) #需要先 `import jieba.analyse`
|
### 基于 TF-IDF 算法的关键词抽取
|
||||||
* sentence 为待提取的文本
|
|
||||||
* topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
|
`import jieba.analyse`
|
||||||
* withWeight 为是否一并返回关键词权重值,默认值为 False
|
|
||||||
|
* jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
|
||||||
|
* sentence 为待提取的文本
|
||||||
|
* topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
|
||||||
|
* withWeight 为是否一并返回关键词权重值,默认值为 False
|
||||||
|
* allowPOS 仅包括指定词性的词,默认值为空,即不筛选
|
||||||
|
* jieba.analyse.TFIDF(idf_path=None) 新建 TFIDF 实例,idf_path 为 IDF 频率文件
|
||||||
|
|
||||||
代码示例 (关键词提取)
|
代码示例 (关键词提取)
|
||||||
|
|
||||||
@ -155,37 +165,27 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
|
|||||||
|
|
||||||
* 用法示例:https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py
|
* 用法示例:https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py
|
||||||
|
|
||||||
#### 基于TextRank算法的关键词抽取实现
|
### 基于 TextRank 算法的关键词抽取
|
||||||
|
|
||||||
|
* jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')) 直接使用,接口相同,注意默认过滤词性。
|
||||||
|
* jieba.analyse.TextRank() 新建自定义 TextRank 实例
|
||||||
|
|
||||||
算法论文: [TextRank: Bringing Order into Texts](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
|
算法论文: [TextRank: Bringing Order into Texts](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
|
||||||
|
|
||||||
##### 基本思想:
|
#### 基本思想:
|
||||||
|
|
||||||
1. 将待抽取关键词的文本进行分词
|
1. 将待抽取关键词的文本进行分词
|
||||||
2. 以固定窗口大小(我选的5,可适当调整),词之间的共现关系,构建图
|
2. 以固定窗口大小(默认为5,通过span属性调整),词之间的共现关系,构建图
|
||||||
3. 计算图中节点的PageRank,注意是无向带权图
|
3. 计算图中节点的PageRank,注意是无向带权图
|
||||||
|
|
||||||
##### 基本使用:
|
#### 使用示例:
|
||||||
jieba.analyse.textrank(raw_text)
|
|
||||||
|
|
||||||
##### 示例结果:
|
见 [test/demo.py](https://github.com/fxsjy/jieba/blob/master/test/demo.py)
|
||||||
来自`__main__`的示例结果:
|
|
||||||
|
|
||||||
```
|
4. 词性标注
|
||||||
吉林 1.0
|
|
||||||
欧亚 0.864834432786
|
|
||||||
置业 0.553465925497
|
|
||||||
实现 0.520660869531
|
|
||||||
收入 0.379699688954
|
|
||||||
增资 0.355086023683
|
|
||||||
子公司 0.349758490263
|
|
||||||
全资 0.308537396283
|
|
||||||
城市 0.306103738053
|
|
||||||
商业 0.304837414946
|
|
||||||
```
|
|
||||||
|
|
||||||
4) : 词性标注
|
|
||||||
-----------
|
-----------
|
||||||
* 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法
|
* `jieba.posseg.POSTokenizer(tokenizer=None)` 新建自定义分词器,`tokenizer` 参数可指定内部使用的 `jieba.Tokenizer` 分词器。`jieba.posseg.dt` 为默认词性标注分词器。
|
||||||
|
* 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法。
|
||||||
* 用法示例
|
* 用法示例
|
||||||
|
|
||||||
```pycon
|
```pycon
|
||||||
@ -200,10 +200,10 @@ jieba.analyse.textrank(raw_text)
|
|||||||
天安门 ns
|
天安门 ns
|
||||||
```
|
```
|
||||||
|
|
||||||
5) : 并行分词
|
5. 并行分词
|
||||||
-----------
|
-----------
|
||||||
* 原理:将目标文本按行分隔后,把各行文本分配到多个 python 进程并行分词,然后归并结果,从而获得分词速度的可观提升
|
* 原理:将目标文本按行分隔后,把各行文本分配到多个 Python 进程并行分词,然后归并结果,从而获得分词速度的可观提升
|
||||||
* 基于 python 自带的 multiprocessing 模块,目前暂不支持 windows
|
* 基于 python 自带的 multiprocessing 模块,目前暂不支持 Windows
|
||||||
* 用法:
|
* 用法:
|
||||||
* `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数
|
* `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数
|
||||||
* `jieba.disable_parallel()` # 关闭并行分词模式
|
* `jieba.disable_parallel()` # 关闭并行分词模式
|
||||||
@ -212,8 +212,9 @@ jieba.analyse.textrank(raw_text)
|
|||||||
|
|
||||||
* 实验结果:在 4 核 3.4GHz Linux 机器上,对金庸全集进行精确分词,获得了 1MB/s 的速度,是单进程版的 3.3 倍。
|
* 实验结果:在 4 核 3.4GHz Linux 机器上,对金庸全集进行精确分词,获得了 1MB/s 的速度,是单进程版的 3.3 倍。
|
||||||
|
|
||||||
|
* **注意**:并行分词仅支持默认分词器 `jieba.dt` 和 `jieba.posseg.dt`。
|
||||||
|
|
||||||
6) : Tokenize:返回词语在原文的起始位置
|
6. Tokenize:返回词语在原文的起止位置
|
||||||
----------------------------------
|
----------------------------------
|
||||||
* 注意,输入参数只接受 unicode
|
* 注意,输入参数只接受 unicode
|
||||||
* 默认模式
|
* 默认模式
|
||||||
@ -235,7 +236,7 @@ word 有限公司 start: 6 end:10
|
|||||||
* 搜索模式
|
* 搜索模式
|
||||||
|
|
||||||
```python
|
```python
|
||||||
result = jieba.tokenize(u'永和服装饰品有限公司',mode='search')
|
result = jieba.tokenize(u'永和服装饰品有限公司', mode='search')
|
||||||
for tk in result:
|
for tk in result:
|
||||||
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
|
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
|
||||||
```
|
```
|
||||||
@ -250,15 +251,15 @@ word 有限公司 start: 6 end:10
|
|||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
7) : ChineseAnalyzer for Whoosh 搜索引擎
|
7. ChineseAnalyzer for Whoosh 搜索引擎
|
||||||
--------------------------------------------
|
--------------------------------------------
|
||||||
* 引用: `from jieba.analyse import ChineseAnalyzer`
|
* 引用: `from jieba.analyse import ChineseAnalyzer`
|
||||||
* 用法示例:https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
|
* 用法示例:https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
|
||||||
|
|
||||||
8) : 命令行分词
|
8. 命令行分词
|
||||||
-------------------
|
-------------------
|
||||||
|
|
||||||
使用示例:`cat news.txt | python -m jieba > cut_result.txt`
|
使用示例:`python -m jieba news.txt > cut_result.txt`
|
||||||
|
|
||||||
命令行选项(翻译):
|
命令行选项(翻译):
|
||||||
|
|
||||||
@ -310,10 +311,10 @@ word 有限公司 start: 6 end:10
|
|||||||
|
|
||||||
If no filename specified, use STDIN instead.
|
If no filename specified, use STDIN instead.
|
||||||
|
|
||||||
模块初始化机制的改变:lazy load (从0.28版本开始)
|
延迟加载机制
|
||||||
-------------------------------------------
|
------------
|
||||||
|
|
||||||
jieba 采用延迟加载,"import jieba" 不会立即触发词典的加载,一旦有必要才开始加载词典构建前缀字典。如果你想手工初始 jieba,也可以手动初始化。
|
jieba 采用延迟加载,`import jieba` 和 `jieba.Tokenizer()` 不会立即触发词典的加载,一旦有必要才开始加载词典构建前缀字典。如果你想手工初始 jieba,也可以手动初始化。
|
||||||
|
|
||||||
import jieba
|
import jieba
|
||||||
jieba.initialize() # 手动初始化(可选)
|
jieba.initialize() # 手动初始化(可选)
|
||||||
@ -460,12 +461,15 @@ Algorithm
|
|||||||
Main Functions
|
Main Functions
|
||||||
==============
|
==============
|
||||||
|
|
||||||
1) : Cut
|
1. Cut
|
||||||
--------
|
--------
|
||||||
* The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
|
* The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
|
||||||
* `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
|
* `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
|
||||||
* The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
|
* The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
|
||||||
* `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode), or `list(jieba.cut( ... ))` to create a list.
|
* `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode).
|
||||||
|
* `jieba.lcut` and `jieba.lcut_for_search` returns a list.
|
||||||
|
* `jieba.Tokenizer(dictionary=DEFAULT_DICT)` creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. `jieba.dt` is the default Tokenizer, to which almost all global functions are mapped.
|
||||||
|
|
||||||
|
|
||||||
**Code example: segmentation**
|
**Code example: segmentation**
|
||||||
|
|
||||||
@ -497,7 +501,7 @@ Output:
|
|||||||
[Search Engine Mode]: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
|
[Search Engine Mode]: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
|
||||||
|
|
||||||
|
|
||||||
2) : Add a custom dictionary
|
2. Add a custom dictionary
|
||||||
----------------------------
|
----------------------------
|
||||||
|
|
||||||
### Load dictionary
|
### Load dictionary
|
||||||
@ -505,6 +509,9 @@ Output:
|
|||||||
* Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but adding your own new words can ensure a higher accuracy.
|
* Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but adding your own new words can ensure a higher accuracy.
|
||||||
* Usage: `jieba.load_userdict(file_name) # file_name is the path of the custom dictionary`
|
* Usage: `jieba.load_userdict(file_name) # file_name is the path of the custom dictionary`
|
||||||
* The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space
|
* The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space
|
||||||
|
* The word frequency can be omitted, then a calculated value will be used.
|
||||||
|
* Change a Tokenizer's `tmp_dir` and `cache_file` to specify the path of the cache file, for using on a restricted file system.
|
||||||
|
|
||||||
* Example:
|
* Example:
|
||||||
|
|
||||||
云计算 5
|
云计算 5
|
||||||
@ -540,12 +547,16 @@ Example:
|
|||||||
「/台中/」/正确/应该/不会/被/切开
|
「/台中/」/正确/应该/不会/被/切开
|
||||||
```
|
```
|
||||||
|
|
||||||
3) : Keyword Extraction
|
3. Keyword Extraction
|
||||||
-----------------------
|
-----------------------
|
||||||
* `jieba.analyse.extract_tags(sentence,topK,withWeight) # needs to first import jieba.analyse`
|
`import jieba.analyse`
|
||||||
* `sentence`: the text to be extracted
|
|
||||||
* `topK`: return how many keywords with the highest TF/IDF weights. The default value is 20
|
* `jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())`
|
||||||
* `withWeight`: whether return TF/IDF weights with the keywords. The default value is False
|
* `sentence`: the text to be extracted
|
||||||
|
* `topK`: return how many keywords with the highest TF/IDF weights. The default value is 20
|
||||||
|
* `withWeight`: whether return TF/IDF weights with the keywords. The default value is False
|
||||||
|
* `allowPOS`: filter words with which POSs are included. Empty for no filtering.
|
||||||
|
* `jieba.analyse.TFIDF(idf_path=None)` creates a new TFIDF instance, `idf_path` specifies IDF file path.
|
||||||
|
|
||||||
Example (keyword extraction)
|
Example (keyword extraction)
|
||||||
|
|
||||||
@ -565,10 +576,15 @@ Developers can specify their own custom stop words corpus in jieba keyword extra
|
|||||||
|
|
||||||
There's also a [TextRank](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) implementation available.
|
There's also a [TextRank](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) implementation available.
|
||||||
|
|
||||||
Use: `jieba.analyse.textrank(raw_text)`.
|
Use: `jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))`
|
||||||
|
|
||||||
4) : Part of Speech Tagging
|
Note that it filters POS by default.
|
||||||
-----------
|
|
||||||
|
`jieba.analyse.TextRank()` creates a new TextRank instance.
|
||||||
|
|
||||||
|
4. Part of Speech Tagging
|
||||||
|
-------------------------
|
||||||
|
* `jieba.posseg.POSTokenizer(tokenizer=None)` creates a new customized Tokenizer. `tokenizer` specifies the jieba.Tokenizer to internally use. `jieba.posseg.dt` is the default POSTokenizer.
|
||||||
* Tags the POS of each word after segmentation, using labels compatible with ictclas.
|
* Tags the POS of each word after segmentation, using labels compatible with ictclas.
|
||||||
* Example:
|
* Example:
|
||||||
|
|
||||||
@ -584,8 +600,8 @@ Use: `jieba.analyse.textrank(raw_text)`.
|
|||||||
天安门 ns
|
天安门 ns
|
||||||
```
|
```
|
||||||
|
|
||||||
5) : Parallel Processing
|
5. Parallel Processing
|
||||||
-----------
|
----------------------
|
||||||
* Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
|
* Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
|
||||||
* Based on the multiprocessing module of Python.
|
* Based on the multiprocessing module of Python.
|
||||||
* Usage:
|
* Usage:
|
||||||
@ -597,8 +613,10 @@ Use: `jieba.analyse.textrank(raw_text)`.
|
|||||||
|
|
||||||
* Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
|
* Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
|
||||||
|
|
||||||
6) : Tokenize: return words with position
|
* **Note** that parallel processing supports only default tokenizers, `jieba.dt` and `jieba.posseg.dt`.
|
||||||
----------------------------------
|
|
||||||
|
6. Tokenize: return words with position
|
||||||
|
----------------------------------------
|
||||||
* The input must be unicode
|
* The input must be unicode
|
||||||
* Default mode
|
* Default mode
|
||||||
|
|
||||||
@ -634,13 +652,13 @@ word 有限公司 start: 6 end:10
|
|||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
7) : ChineseAnalyzer for Whoosh
|
7. ChineseAnalyzer for Whoosh
|
||||||
--------------------------------------------
|
-------------------------------
|
||||||
* `from jieba.analyse import ChineseAnalyzer`
|
* `from jieba.analyse import ChineseAnalyzer`
|
||||||
* Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
|
* Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
|
||||||
|
|
||||||
8) : Command Line Interface
|
8. Command Line Interface
|
||||||
-------------------
|
--------------------------------
|
||||||
|
|
||||||
$> python -m jieba --help
|
$> python -m jieba --help
|
||||||
usage: python -m jieba [options] filename
|
usage: python -m jieba [options] filename
|
||||||
@ -679,7 +697,8 @@ You can also specify the dictionary (not supported before version 0.28) :
|
|||||||
|
|
||||||
|
|
||||||
Using Other Dictionaries
|
Using Other Dictionaries
|
||||||
========
|
===========================
|
||||||
|
|
||||||
It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:
|
It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:
|
||||||
|
|
||||||
1. A smaller dictionary for a smaller memory footprint:
|
1. A smaller dictionary for a smaller memory footprint:
|
||||||
|
@ -6,47 +6,70 @@ import re
|
|||||||
import os
|
import os
|
||||||
import sys
|
import sys
|
||||||
import time
|
import time
|
||||||
import tempfile
|
|
||||||
import marshal
|
|
||||||
from math import log
|
|
||||||
import threading
|
|
||||||
from functools import wraps
|
|
||||||
import logging
|
import logging
|
||||||
|
import marshal
|
||||||
|
import tempfile
|
||||||
|
import threading
|
||||||
|
from math import log
|
||||||
from hashlib import md5
|
from hashlib import md5
|
||||||
from ._compat import *
|
from ._compat import *
|
||||||
from . import finalseg
|
from . import finalseg
|
||||||
|
|
||||||
DICTIONARY = "dict.txt"
|
if os.name == 'nt':
|
||||||
DICT_LOCK = threading.RLock()
|
from shutil import move as _replace_file
|
||||||
FREQ = {} # to be initialized
|
else:
|
||||||
total = 0
|
_replace_file = os.rename
|
||||||
user_word_tag_tab = {}
|
|
||||||
initialized = False
|
|
||||||
pool = None
|
|
||||||
tmp_dir = None
|
|
||||||
|
|
||||||
_curpath = os.path.normpath(
|
_get_module_path = lambda path: os.path.normpath(os.path.join(os.getcwd(),
|
||||||
os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
os.path.dirname(__file__), path))
|
||||||
|
_get_abs_path = lambda path: os.path.normpath(os.path.join(os.getcwd(), path))
|
||||||
|
|
||||||
|
DEFAULT_DICT = _get_module_path("dict.txt")
|
||||||
|
|
||||||
log_console = logging.StreamHandler(sys.stderr)
|
log_console = logging.StreamHandler(sys.stderr)
|
||||||
logger = logging.getLogger(__name__)
|
default_logger = logging.getLogger(__name__)
|
||||||
logger.setLevel(logging.DEBUG)
|
default_logger.setLevel(logging.DEBUG)
|
||||||
logger.addHandler(log_console)
|
default_logger.addHandler(log_console)
|
||||||
|
|
||||||
|
DICT_WRITING = {}
|
||||||
|
|
||||||
|
pool = None
|
||||||
|
|
||||||
|
re_eng = re.compile('[a-zA-Z0-9]', re.U)
|
||||||
|
|
||||||
|
# \u4E00-\u9FA5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
|
||||||
|
# \r\n|\s : whitespace characters. Will not be handled.
|
||||||
|
re_han_default = re.compile("([\u4E00-\u9FA5a-zA-Z0-9+#&\._]+)", re.U)
|
||||||
|
re_skip_default = re.compile("(\r\n|\s)", re.U)
|
||||||
|
re_han_cut_all = re.compile("([\u4E00-\u9FA5]+)", re.U)
|
||||||
|
re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)
|
||||||
|
|
||||||
def setLogLevel(log_level):
|
def setLogLevel(log_level):
|
||||||
global logger
|
global logger
|
||||||
logger.setLevel(log_level)
|
default_logger.setLevel(log_level)
|
||||||
|
|
||||||
|
class Tokenizer(object):
|
||||||
|
|
||||||
def gen_pfdict(f_name):
|
def __init__(self, dictionary=DEFAULT_DICT):
|
||||||
|
self.lock = threading.RLock()
|
||||||
|
self.dictionary = _get_abs_path(dictionary)
|
||||||
|
self.FREQ = {}
|
||||||
|
self.total = 0
|
||||||
|
self.user_word_tag_tab = {}
|
||||||
|
self.initialized = False
|
||||||
|
self.tmp_dir = None
|
||||||
|
self.cache_file = None
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return '<Tokenizer dictionary=%r>' % self.dictionary
|
||||||
|
|
||||||
|
def gen_pfdict(self, f_name):
|
||||||
lfreq = {}
|
lfreq = {}
|
||||||
ltotal = 0
|
ltotal = 0
|
||||||
with open(f_name, 'rb') as f:
|
with open(f_name, 'rb') as f:
|
||||||
lineno = 0
|
for lineno, line in enumerate(f, 1):
|
||||||
for line in f.read().rstrip().decode('utf-8').splitlines():
|
|
||||||
lineno += 1
|
|
||||||
try:
|
try:
|
||||||
|
line = line.strip().decode('utf-8')
|
||||||
word, freq = line.split(' ')[:2]
|
word, freq = line.split(' ')[:2]
|
||||||
freq = int(freq)
|
freq = int(freq)
|
||||||
lfreq[word] = freq
|
lfreq[word] = freq
|
||||||
@ -55,77 +78,113 @@ def gen_pfdict(f_name):
|
|||||||
wfrag = word[:ch + 1]
|
wfrag = word[:ch + 1]
|
||||||
if wfrag not in lfreq:
|
if wfrag not in lfreq:
|
||||||
lfreq[wfrag] = 0
|
lfreq[wfrag] = 0
|
||||||
except ValueError as e:
|
except ValueError:
|
||||||
logger.debug('%s at line %s %s' % (f_name, lineno, line))
|
raise ValueError(
|
||||||
raise e
|
'invalid dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
|
||||||
return lfreq, ltotal
|
return lfreq, ltotal
|
||||||
|
|
||||||
|
def initialize(self, dictionary=None):
|
||||||
|
if dictionary:
|
||||||
|
abs_path = _get_abs_path(dictionary)
|
||||||
|
if self.dictionary == abs_path and self.initialized:
|
||||||
|
return
|
||||||
|
else:
|
||||||
|
self.dictionary = abs_path
|
||||||
|
self.initialized = False
|
||||||
|
else:
|
||||||
|
abs_path = self.dictionary
|
||||||
|
|
||||||
def initialize(dictionary=None):
|
with self.lock:
|
||||||
global FREQ, total, initialized, DICTIONARY, DICT_LOCK, tmp_dir
|
try:
|
||||||
if not dictionary:
|
with DICT_WRITING[abs_path]:
|
||||||
dictionary = DICTIONARY
|
pass
|
||||||
with DICT_LOCK:
|
except KeyError:
|
||||||
if initialized:
|
pass
|
||||||
|
if self.initialized:
|
||||||
return
|
return
|
||||||
|
|
||||||
abs_path = os.path.join(_curpath, dictionary)
|
default_logger.debug("Building prefix dict from %s ..." % abs_path)
|
||||||
logger.debug("Building prefix dict from %s ..." % abs_path)
|
|
||||||
t1 = time.time()
|
t1 = time.time()
|
||||||
|
if self.cache_file:
|
||||||
|
cache_file = self.cache_file
|
||||||
# default dictionary
|
# default dictionary
|
||||||
if abs_path == os.path.join(_curpath, "dict.txt"):
|
elif abs_path == DEFAULT_DICT:
|
||||||
cache_file = os.path.join(tmp_dir if tmp_dir else tempfile.gettempdir(),"jieba.cache")
|
cache_file = "jieba.cache"
|
||||||
else: # custom dictionary
|
else: # custom dictionary
|
||||||
cache_file = os.path.join(tmp_dir if tmp_dir else tempfile.gettempdir(),"jieba.u%s.cache" % md5(
|
cache_file = "jieba.u%s.cache" % md5(
|
||||||
abs_path.encode('utf-8', 'replace')).hexdigest())
|
abs_path.encode('utf-8', 'replace')).hexdigest()
|
||||||
|
cache_file = os.path.join(
|
||||||
|
self.tmp_dir or tempfile.gettempdir(), cache_file)
|
||||||
|
|
||||||
load_from_cache_fail = True
|
load_from_cache_fail = True
|
||||||
if os.path.isfile(cache_file) and os.path.getmtime(cache_file) > os.path.getmtime(abs_path):
|
if os.path.isfile(cache_file) and os.path.getmtime(cache_file) > os.path.getmtime(abs_path):
|
||||||
logger.debug("Loading model from cache %s" % cache_file)
|
default_logger.debug(
|
||||||
|
"Loading model from cache %s" % cache_file)
|
||||||
try:
|
try:
|
||||||
with open(cache_file, 'rb') as cf:
|
with open(cache_file, 'rb') as cf:
|
||||||
FREQ, total = marshal.load(cf)
|
self.FREQ, self.total = marshal.load(cf)
|
||||||
load_from_cache_fail = False
|
load_from_cache_fail = False
|
||||||
except Exception:
|
except Exception:
|
||||||
load_from_cache_fail = True
|
load_from_cache_fail = True
|
||||||
|
|
||||||
if load_from_cache_fail:
|
if load_from_cache_fail:
|
||||||
FREQ, total = gen_pfdict(abs_path)
|
wlock = DICT_WRITING.get(abs_path, threading.RLock())
|
||||||
logger.debug("Dumping model to file cache %s" % cache_file)
|
DICT_WRITING[abs_path] = wlock
|
||||||
|
with wlock:
|
||||||
|
self.FREQ, self.total = self.gen_pfdict(abs_path)
|
||||||
|
default_logger.debug(
|
||||||
|
"Dumping model to file cache %s" % cache_file)
|
||||||
try:
|
try:
|
||||||
fd, fpath = tempfile.mkstemp()
|
fd, fpath = tempfile.mkstemp()
|
||||||
with os.fdopen(fd, 'wb') as temp_cache_file:
|
with os.fdopen(fd, 'wb') as temp_cache_file:
|
||||||
marshal.dump((FREQ, total), temp_cache_file)
|
marshal.dump(
|
||||||
if os.name == 'nt':
|
(self.FREQ, self.total), temp_cache_file)
|
||||||
from shutil import move as replace_file
|
_replace_file(fpath, cache_file)
|
||||||
else:
|
|
||||||
replace_file = os.rename
|
|
||||||
replace_file(fpath, cache_file)
|
|
||||||
except Exception:
|
except Exception:
|
||||||
logger.exception("Dump cache file failed.")
|
default_logger.exception("Dump cache file failed.")
|
||||||
|
|
||||||
initialized = True
|
try:
|
||||||
|
del DICT_WRITING[abs_path]
|
||||||
|
except KeyError:
|
||||||
|
pass
|
||||||
|
|
||||||
logger.debug("Loading model cost %s seconds." % (time.time() - t1))
|
self.initialized = True
|
||||||
logger.debug("Prefix dict has been built succesfully.")
|
default_logger.debug(
|
||||||
|
"Loading model cost %.3f seconds." % (time.time() - t1))
|
||||||
|
default_logger.debug("Prefix dict has been built succesfully.")
|
||||||
|
|
||||||
|
def check_initialized(self):
|
||||||
|
if not self.initialized:
|
||||||
|
self.initialize()
|
||||||
|
|
||||||
def require_initialized(fn):
|
def calc(self, sentence, DAG, route):
|
||||||
|
N = len(sentence)
|
||||||
|
route[N] = (0, 0)
|
||||||
|
logtotal = log(self.total)
|
||||||
|
for idx in xrange(N - 1, -1, -1):
|
||||||
|
route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) -
|
||||||
|
logtotal + route[x + 1][0], x) for x in DAG[idx])
|
||||||
|
|
||||||
@wraps(fn)
|
def get_DAG(self, sentence):
|
||||||
def wrapped(*args, **kwargs):
|
self.check_initialized()
|
||||||
global initialized
|
DAG = {}
|
||||||
if initialized:
|
N = len(sentence)
|
||||||
return fn(*args, **kwargs)
|
for k in xrange(N):
|
||||||
else:
|
tmplist = []
|
||||||
initialize(DICTIONARY)
|
i = k
|
||||||
return fn(*args, **kwargs)
|
frag = sentence[k]
|
||||||
|
while i < N and frag in self.FREQ:
|
||||||
|
if self.FREQ[frag]:
|
||||||
|
tmplist.append(i)
|
||||||
|
i += 1
|
||||||
|
frag = sentence[k:i + 1]
|
||||||
|
if not tmplist:
|
||||||
|
tmplist.append(k)
|
||||||
|
DAG[k] = tmplist
|
||||||
|
return DAG
|
||||||
|
|
||||||
return wrapped
|
def __cut_all(self, sentence):
|
||||||
|
dag = self.get_DAG(sentence)
|
||||||
|
|
||||||
def __cut_all(sentence):
|
|
||||||
dag = get_DAG(sentence)
|
|
||||||
old_j = -1
|
old_j = -1
|
||||||
for k, L in iteritems(dag):
|
for k, L in iteritems(dag):
|
||||||
if len(L) == 1 and k > old_j:
|
if len(L) == 1 and k > old_j:
|
||||||
@ -137,42 +196,10 @@ def __cut_all(sentence):
|
|||||||
yield sentence[k:j + 1]
|
yield sentence[k:j + 1]
|
||||||
old_j = j
|
old_j = j
|
||||||
|
|
||||||
|
def __cut_DAG_NO_HMM(self, sentence):
|
||||||
def calc(sentence, DAG, route):
|
DAG = self.get_DAG(sentence)
|
||||||
N = len(sentence)
|
|
||||||
route[N] = (0, 0)
|
|
||||||
logtotal = log(total)
|
|
||||||
for idx in xrange(N - 1, -1, -1):
|
|
||||||
route[idx] = max((log(FREQ.get(sentence[idx:x + 1]) or 1) -
|
|
||||||
logtotal + route[x + 1][0], x) for x in DAG[idx])
|
|
||||||
|
|
||||||
|
|
||||||
@require_initialized
|
|
||||||
def get_DAG(sentence):
|
|
||||||
global FREQ
|
|
||||||
DAG = {}
|
|
||||||
N = len(sentence)
|
|
||||||
for k in xrange(N):
|
|
||||||
tmplist = []
|
|
||||||
i = k
|
|
||||||
frag = sentence[k]
|
|
||||||
while i < N and frag in FREQ:
|
|
||||||
if FREQ[frag]:
|
|
||||||
tmplist.append(i)
|
|
||||||
i += 1
|
|
||||||
frag = sentence[k:i + 1]
|
|
||||||
if not tmplist:
|
|
||||||
tmplist.append(k)
|
|
||||||
DAG[k] = tmplist
|
|
||||||
return DAG
|
|
||||||
|
|
||||||
re_eng = re.compile('[a-zA-Z0-9]', re.U)
|
|
||||||
|
|
||||||
|
|
||||||
def __cut_DAG_NO_HMM(sentence):
|
|
||||||
DAG = get_DAG(sentence)
|
|
||||||
route = {}
|
route = {}
|
||||||
calc(sentence, DAG, route)
|
self.calc(sentence, DAG, route)
|
||||||
x = 0
|
x = 0
|
||||||
N = len(sentence)
|
N = len(sentence)
|
||||||
buf = ''
|
buf = ''
|
||||||
@ -192,11 +219,10 @@ def __cut_DAG_NO_HMM(sentence):
|
|||||||
yield buf
|
yield buf
|
||||||
buf = ''
|
buf = ''
|
||||||
|
|
||||||
|
def __cut_DAG(self, sentence):
|
||||||
def __cut_DAG(sentence):
|
DAG = self.get_DAG(sentence)
|
||||||
DAG = get_DAG(sentence)
|
|
||||||
route = {}
|
route = {}
|
||||||
calc(sentence, DAG, route=route)
|
self.calc(sentence, DAG, route)
|
||||||
x = 0
|
x = 0
|
||||||
buf = ''
|
buf = ''
|
||||||
N = len(sentence)
|
N = len(sentence)
|
||||||
@ -211,7 +237,7 @@ def __cut_DAG(sentence):
|
|||||||
yield buf
|
yield buf
|
||||||
buf = ''
|
buf = ''
|
||||||
else:
|
else:
|
||||||
if not FREQ.get(buf):
|
if not self.FREQ.get(buf):
|
||||||
recognized = finalseg.cut(buf)
|
recognized = finalseg.cut(buf)
|
||||||
for t in recognized:
|
for t in recognized:
|
||||||
yield t
|
yield t
|
||||||
@ -225,7 +251,7 @@ def __cut_DAG(sentence):
|
|||||||
if buf:
|
if buf:
|
||||||
if len(buf) == 1:
|
if len(buf) == 1:
|
||||||
yield buf
|
yield buf
|
||||||
elif not FREQ.get(buf):
|
elif not self.FREQ.get(buf):
|
||||||
recognized = finalseg.cut(buf)
|
recognized = finalseg.cut(buf)
|
||||||
for t in recognized:
|
for t in recognized:
|
||||||
yield t
|
yield t
|
||||||
@ -233,13 +259,7 @@ def __cut_DAG(sentence):
|
|||||||
for elem in buf:
|
for elem in buf:
|
||||||
yield elem
|
yield elem
|
||||||
|
|
||||||
re_han_default = re.compile("([\u4E00-\u9FA5a-zA-Z0-9+#&\._]+)", re.U)
|
def cut(self, sentence, cut_all=False, HMM=True):
|
||||||
re_skip_default = re.compile("(\r\n|\s)", re.U)
|
|
||||||
re_han_cut_all = re.compile("([\u4E00-\u9FA5]+)", re.U)
|
|
||||||
re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)
|
|
||||||
|
|
||||||
|
|
||||||
def cut(sentence, cut_all=False, HMM=True):
|
|
||||||
'''
|
'''
|
||||||
The main function that segments an entire sentence that contains
|
The main function that segments an entire sentence that contains
|
||||||
Chinese characters into seperated words.
|
Chinese characters into seperated words.
|
||||||
@ -251,22 +271,19 @@ def cut(sentence, cut_all=False, HMM=True):
|
|||||||
'''
|
'''
|
||||||
sentence = strdecode(sentence)
|
sentence = strdecode(sentence)
|
||||||
|
|
||||||
# \u4E00-\u9FA5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
|
|
||||||
# \r\n|\s : whitespace characters. Will not be handled.
|
|
||||||
|
|
||||||
if cut_all:
|
if cut_all:
|
||||||
re_han = re_han_cut_all
|
re_han = re_han_cut_all
|
||||||
re_skip = re_skip_cut_all
|
re_skip = re_skip_cut_all
|
||||||
else:
|
else:
|
||||||
re_han = re_han_default
|
re_han = re_han_default
|
||||||
re_skip = re_skip_default
|
re_skip = re_skip_default
|
||||||
blocks = re_han.split(sentence)
|
|
||||||
if cut_all:
|
if cut_all:
|
||||||
cut_block = __cut_all
|
cut_block = self.__cut_all
|
||||||
elif HMM:
|
elif HMM:
|
||||||
cut_block = __cut_DAG
|
cut_block = self.__cut_DAG
|
||||||
else:
|
else:
|
||||||
cut_block = __cut_DAG_NO_HMM
|
cut_block = self.__cut_DAG_NO_HMM
|
||||||
|
blocks = re_han.split(sentence)
|
||||||
for blk in blocks:
|
for blk in blocks:
|
||||||
if not blk:
|
if not blk:
|
||||||
continue
|
continue
|
||||||
@ -284,12 +301,11 @@ def cut(sentence, cut_all=False, HMM=True):
|
|||||||
else:
|
else:
|
||||||
yield x
|
yield x
|
||||||
|
|
||||||
|
def cut_for_search(self, sentence, HMM=True):
|
||||||
def cut_for_search(sentence, HMM=True):
|
|
||||||
"""
|
"""
|
||||||
Finer segmentation for search engines.
|
Finer segmentation for search engines.
|
||||||
"""
|
"""
|
||||||
words = cut(sentence, HMM=HMM)
|
words = self.cut(sentence, HMM=HMM)
|
||||||
for w in words:
|
for w in words:
|
||||||
if len(w) > 2:
|
if len(w) > 2:
|
||||||
for i in xrange(len(w) - 1):
|
for i in xrange(len(w) - 1):
|
||||||
@ -303,9 +319,28 @@ def cut_for_search(sentence, HMM=True):
|
|||||||
yield gram3
|
yield gram3
|
||||||
yield w
|
yield w
|
||||||
|
|
||||||
|
def lcut(self, *args, **kwargs):
|
||||||
|
return list(self.cut(*args, **kwargs))
|
||||||
|
|
||||||
@require_initialized
|
def lcut_for_search(self, *args, **kwargs):
|
||||||
def load_userdict(f):
|
return list(self.cut_for_search(*args, **kwargs))
|
||||||
|
|
||||||
|
_lcut = lcut
|
||||||
|
_lcut_for_search = lcut_for_search
|
||||||
|
|
||||||
|
def _lcut_no_hmm(self, sentence):
|
||||||
|
return self.lcut(sentence, False, False)
|
||||||
|
|
||||||
|
def _lcut_all(self, sentence):
|
||||||
|
return self.lcut(sentence, True)
|
||||||
|
|
||||||
|
def _lcut_for_search_no_hmm(self, sentence):
|
||||||
|
return self.lcut_for_search(sentence, False)
|
||||||
|
|
||||||
|
def get_abs_path_dict(self):
|
||||||
|
return _get_abs_path(self.dictionary)
|
||||||
|
|
||||||
|
def load_userdict(self, f):
|
||||||
'''
|
'''
|
||||||
Load personalized dict to improve detect rate.
|
Load personalized dict to improve detect rate.
|
||||||
|
|
||||||
@ -318,56 +353,50 @@ def load_userdict(f):
|
|||||||
...
|
...
|
||||||
Word type may be ignored
|
Word type may be ignored
|
||||||
'''
|
'''
|
||||||
|
self.check_initialized()
|
||||||
if isinstance(f, string_types):
|
if isinstance(f, string_types):
|
||||||
f = open(f, 'rb')
|
f = open(f, 'rb')
|
||||||
content = f.read().decode('utf-8').lstrip('\ufeff')
|
for lineno, ln in enumerate(f, 1):
|
||||||
line_no = 0
|
|
||||||
for line in content.splitlines():
|
|
||||||
try:
|
try:
|
||||||
line_no += 1
|
line = ln.strip().decode('utf-8').lstrip('\ufeff')
|
||||||
line = line.strip()
|
|
||||||
if not line:
|
if not line:
|
||||||
continue
|
continue
|
||||||
tup = line.split(" ")
|
tup = line.split(" ")
|
||||||
add_word(*tup)
|
self.add_word(*tup)
|
||||||
except Exception as e:
|
except Exception:
|
||||||
logger.debug('%s at line %s %s' % (f.name, line_no, line))
|
raise ValueError(
|
||||||
raise e
|
'invalid dictionary entry in %s at Line %s: %s' % (
|
||||||
|
f.name, lineno, line))
|
||||||
|
|
||||||
|
def add_word(self, word, freq=None, tag=None):
|
||||||
@require_initialized
|
|
||||||
def add_word(word, freq=None, tag=None):
|
|
||||||
"""
|
"""
|
||||||
Add a word to dictionary.
|
Add a word to dictionary.
|
||||||
|
|
||||||
freq and tag can be omitted, freq defaults to be a calculated value
|
freq and tag can be omitted, freq defaults to be a calculated value
|
||||||
that ensures the word can be cut out.
|
that ensures the word can be cut out.
|
||||||
"""
|
"""
|
||||||
global FREQ, total, user_word_tag_tab
|
self.check_initialized()
|
||||||
word = strdecode(word)
|
word = strdecode(word)
|
||||||
if freq is None:
|
if freq is None:
|
||||||
freq = suggest_freq(word, False)
|
freq = self.suggest_freq(word, False)
|
||||||
else:
|
else:
|
||||||
freq = int(freq)
|
freq = int(freq)
|
||||||
FREQ[word] = freq
|
self.FREQ[word] = freq
|
||||||
total += freq
|
self.total += freq
|
||||||
if tag is not None:
|
if tag is not None:
|
||||||
user_word_tag_tab[word] = tag
|
self.user_word_tag_tab[word] = tag
|
||||||
for ch in xrange(len(word)):
|
for ch in xrange(len(word)):
|
||||||
wfrag = word[:ch + 1]
|
wfrag = word[:ch + 1]
|
||||||
if wfrag not in FREQ:
|
if wfrag not in self.FREQ:
|
||||||
FREQ[wfrag] = 0
|
self.FREQ[wfrag] = 0
|
||||||
|
|
||||||
|
def del_word(self, word):
|
||||||
def del_word(word):
|
|
||||||
"""
|
"""
|
||||||
Convenient function for deleting a word.
|
Convenient function for deleting a word.
|
||||||
"""
|
"""
|
||||||
add_word(word, 0)
|
self.add_word(word, 0)
|
||||||
|
|
||||||
|
def suggest_freq(self, segment, tune=False):
|
||||||
@require_initialized
|
|
||||||
def suggest_freq(segment, tune=False):
|
|
||||||
"""
|
"""
|
||||||
Suggest word frequency to force the characters in a word to be
|
Suggest word frequency to force the characters in a word to be
|
||||||
joined or splitted.
|
joined or splitted.
|
||||||
@ -380,101 +409,25 @@ def suggest_freq(segment, tune=False):
|
|||||||
Note that HMM may affect the final result. If the result doesn't change,
|
Note that HMM may affect the final result. If the result doesn't change,
|
||||||
set HMM=False.
|
set HMM=False.
|
||||||
"""
|
"""
|
||||||
ftotal = float(total)
|
self.check_initialized()
|
||||||
|
ftotal = float(self.total)
|
||||||
freq = 1
|
freq = 1
|
||||||
if isinstance(segment, string_types):
|
if isinstance(segment, string_types):
|
||||||
word = segment
|
word = segment
|
||||||
for seg in cut(word, HMM=False):
|
for seg in self.cut(word, HMM=False):
|
||||||
freq *= FREQ.get(seg, 1) / ftotal
|
freq *= self.FREQ.get(seg, 1) / ftotal
|
||||||
freq = max(int(freq*total) + 1, FREQ.get(word, 1))
|
freq = max(int(freq * self.total) + 1, self.FREQ.get(word, 1))
|
||||||
else:
|
else:
|
||||||
segment = tuple(map(strdecode, segment))
|
segment = tuple(map(strdecode, segment))
|
||||||
word = ''.join(segment)
|
word = ''.join(segment)
|
||||||
for seg in segment:
|
for seg in segment:
|
||||||
freq *= FREQ.get(seg, 1) / ftotal
|
freq *= self.FREQ.get(seg, 1) / ftotal
|
||||||
freq = min(int(freq*total), FREQ.get(word, 0))
|
freq = min(int(freq * self.total), self.FREQ.get(word, 0))
|
||||||
if tune:
|
if tune:
|
||||||
add_word(word, freq)
|
add_word(word, freq)
|
||||||
return freq
|
return freq
|
||||||
|
|
||||||
|
def tokenize(self, unicode_sentence, mode="default", HMM=True):
|
||||||
__ref_cut = cut
|
|
||||||
__ref_cut_for_search = cut_for_search
|
|
||||||
|
|
||||||
|
|
||||||
def __lcut(sentence):
|
|
||||||
return list(__ref_cut(sentence, False))
|
|
||||||
|
|
||||||
|
|
||||||
def __lcut_no_hmm(sentence):
|
|
||||||
return list(__ref_cut(sentence, False, False))
|
|
||||||
|
|
||||||
|
|
||||||
def __lcut_all(sentence):
|
|
||||||
return list(__ref_cut(sentence, True))
|
|
||||||
|
|
||||||
|
|
||||||
def __lcut_for_search(sentence):
|
|
||||||
return list(__ref_cut_for_search(sentence))
|
|
||||||
|
|
||||||
|
|
||||||
@require_initialized
|
|
||||||
def enable_parallel(processnum=None):
|
|
||||||
global pool, cut, cut_for_search
|
|
||||||
if os.name == 'nt':
|
|
||||||
raise Exception("jieba: parallel mode only supports posix system")
|
|
||||||
from multiprocessing import Pool, cpu_count
|
|
||||||
if processnum is None:
|
|
||||||
processnum = cpu_count()
|
|
||||||
pool = Pool(processnum)
|
|
||||||
|
|
||||||
def pcut(sentence, cut_all=False, HMM=True):
|
|
||||||
parts = strdecode(sentence).splitlines(True)
|
|
||||||
if cut_all:
|
|
||||||
result = pool.map(__lcut_all, parts)
|
|
||||||
elif HMM:
|
|
||||||
result = pool.map(__lcut, parts)
|
|
||||||
else:
|
|
||||||
result = pool.map(__lcut_no_hmm, parts)
|
|
||||||
for r in result:
|
|
||||||
for w in r:
|
|
||||||
yield w
|
|
||||||
|
|
||||||
def pcut_for_search(sentence):
|
|
||||||
parts = strdecode(sentence).splitlines(True)
|
|
||||||
result = pool.map(__lcut_for_search, parts)
|
|
||||||
for r in result:
|
|
||||||
for w in r:
|
|
||||||
yield w
|
|
||||||
|
|
||||||
cut = pcut
|
|
||||||
cut_for_search = pcut_for_search
|
|
||||||
|
|
||||||
|
|
||||||
def disable_parallel():
|
|
||||||
global pool, cut, cut_for_search
|
|
||||||
if pool:
|
|
||||||
pool.close()
|
|
||||||
pool = None
|
|
||||||
cut = __ref_cut
|
|
||||||
cut_for_search = __ref_cut_for_search
|
|
||||||
|
|
||||||
|
|
||||||
def set_dictionary(dictionary_path):
|
|
||||||
global initialized, DICTIONARY
|
|
||||||
with DICT_LOCK:
|
|
||||||
abs_path = os.path.normpath(os.path.join(os.getcwd(), dictionary_path))
|
|
||||||
if not os.path.isfile(abs_path):
|
|
||||||
raise Exception("jieba: file does not exist: " + abs_path)
|
|
||||||
DICTIONARY = abs_path
|
|
||||||
initialized = False
|
|
||||||
|
|
||||||
|
|
||||||
def get_abs_path_dict():
|
|
||||||
return os.path.join(_curpath, DICTIONARY)
|
|
||||||
|
|
||||||
|
|
||||||
def tokenize(unicode_sentence, mode="default", HMM=True):
|
|
||||||
"""
|
"""
|
||||||
Tokenize a sentence and yields tuples of (word, start, end)
|
Tokenize a sentence and yields tuples of (word, start, end)
|
||||||
|
|
||||||
@ -484,25 +437,133 @@ def tokenize(unicode_sentence, mode="default", HMM=True):
|
|||||||
- HMM: whether to use the Hidden Markov Model.
|
- HMM: whether to use the Hidden Markov Model.
|
||||||
"""
|
"""
|
||||||
if not isinstance(unicode_sentence, text_type):
|
if not isinstance(unicode_sentence, text_type):
|
||||||
raise Exception("jieba: the input parameter should be unicode.")
|
raise ValueError("jieba: the input parameter should be unicode.")
|
||||||
start = 0
|
start = 0
|
||||||
if mode == 'default':
|
if mode == 'default':
|
||||||
for w in cut(unicode_sentence, HMM=HMM):
|
for w in self.cut(unicode_sentence, HMM=HMM):
|
||||||
width = len(w)
|
width = len(w)
|
||||||
yield (w, start, start + width)
|
yield (w, start, start + width)
|
||||||
start += width
|
start += width
|
||||||
else:
|
else:
|
||||||
for w in cut(unicode_sentence, HMM=HMM):
|
for w in self.cut(unicode_sentence, HMM=HMM):
|
||||||
width = len(w)
|
width = len(w)
|
||||||
if len(w) > 2:
|
if len(w) > 2:
|
||||||
for i in xrange(len(w) - 1):
|
for i in xrange(len(w) - 1):
|
||||||
gram2 = w[i:i + 2]
|
gram2 = w[i:i + 2]
|
||||||
if FREQ.get(gram2):
|
if self.FREQ.get(gram2):
|
||||||
yield (gram2, start + i, start + i + 2)
|
yield (gram2, start + i, start + i + 2)
|
||||||
if len(w) > 3:
|
if len(w) > 3:
|
||||||
for i in xrange(len(w) - 2):
|
for i in xrange(len(w) - 2):
|
||||||
gram3 = w[i:i + 3]
|
gram3 = w[i:i + 3]
|
||||||
if FREQ.get(gram3):
|
if self.FREQ.get(gram3):
|
||||||
yield (gram3, start + i, start + i + 3)
|
yield (gram3, start + i, start + i + 3)
|
||||||
yield (w, start, start + width)
|
yield (w, start, start + width)
|
||||||
start += width
|
start += width
|
||||||
|
|
||||||
|
def set_dictionary(self, dictionary_path):
|
||||||
|
with self.lock:
|
||||||
|
abs_path = _get_abs_path(dictionary_path)
|
||||||
|
if not os.path.isfile(abs_path):
|
||||||
|
raise Exception("jieba: file does not exist: " + abs_path)
|
||||||
|
self.dictionary = abs_path
|
||||||
|
self.initialized = False
|
||||||
|
|
||||||
|
|
||||||
|
# default Tokenizer instance
|
||||||
|
|
||||||
|
dt = Tokenizer()
|
||||||
|
|
||||||
|
# global functions
|
||||||
|
|
||||||
|
FREQ = dt.FREQ
|
||||||
|
add_word = dt.add_word
|
||||||
|
calc = dt.calc
|
||||||
|
cut = dt.cut
|
||||||
|
lcut = dt.lcut
|
||||||
|
cut_for_search = dt.cut_for_search
|
||||||
|
lcut_for_search = dt.lcut_for_search
|
||||||
|
del_word = dt.del_word
|
||||||
|
get_DAG = dt.get_DAG
|
||||||
|
get_abs_path_dict = dt.get_abs_path_dict
|
||||||
|
initialize = dt.initialize
|
||||||
|
load_userdict = dt.load_userdict
|
||||||
|
set_dictionary = dt.set_dictionary
|
||||||
|
suggest_freq = dt.suggest_freq
|
||||||
|
tokenize = dt.tokenize
|
||||||
|
user_word_tag_tab = dt.user_word_tag_tab
|
||||||
|
|
||||||
|
|
||||||
|
def _lcut_all(s):
|
||||||
|
return dt._lcut_all(s)
|
||||||
|
|
||||||
|
|
||||||
|
def _lcut(s):
|
||||||
|
return dt._lcut(s)
|
||||||
|
|
||||||
|
|
||||||
|
def _lcut_all(s):
|
||||||
|
return dt._lcut_all(s)
|
||||||
|
|
||||||
|
|
||||||
|
def _lcut_for_search(s):
|
||||||
|
return dt._lcut_for_search(s)
|
||||||
|
|
||||||
|
|
||||||
|
def _lcut_for_search_no_hmm(s):
|
||||||
|
return dt._lcut_for_search_no_hmm(s)
|
||||||
|
|
||||||
|
|
||||||
|
def _pcut(sentence, cut_all=False, HMM=True):
|
||||||
|
parts = strdecode(sentence).splitlines(True)
|
||||||
|
if cut_all:
|
||||||
|
result = pool.map(_lcut_all, parts)
|
||||||
|
elif HMM:
|
||||||
|
result = pool.map(_lcut, parts)
|
||||||
|
else:
|
||||||
|
result = pool.map(_lcut_no_hmm, parts)
|
||||||
|
for r in result:
|
||||||
|
for w in r:
|
||||||
|
yield w
|
||||||
|
|
||||||
|
|
||||||
|
def _pcut_for_search(sentence, HMM=True):
|
||||||
|
parts = strdecode(sentence).splitlines(True)
|
||||||
|
if HMM:
|
||||||
|
result = pool.map(_lcut_for_search, parts)
|
||||||
|
else:
|
||||||
|
result = pool.map(_lcut_for_search_no_hmm, parts)
|
||||||
|
for r in result:
|
||||||
|
for w in r:
|
||||||
|
yield w
|
||||||
|
|
||||||
|
|
||||||
|
def enable_parallel(processnum=None):
|
||||||
|
"""
|
||||||
|
Change the module's `cut` and `cut_for_search` functions to the
|
||||||
|
parallel version.
|
||||||
|
|
||||||
|
Note that this only works using dt, custom Tokenizer
|
||||||
|
instances are not supported.
|
||||||
|
"""
|
||||||
|
global pool, dt, cut, cut_for_search
|
||||||
|
from multiprocessing import cpu_count
|
||||||
|
if os.name == 'nt':
|
||||||
|
raise NotImplementedError(
|
||||||
|
"jieba: parallel mode only supports posix system")
|
||||||
|
else:
|
||||||
|
from multiprocessing import Pool
|
||||||
|
dt.check_initialized()
|
||||||
|
if processnum is None:
|
||||||
|
processnum = cpu_count()
|
||||||
|
pool = Pool(processnum)
|
||||||
|
cut = _pcut
|
||||||
|
cut_for_search = _pcut_for_search
|
||||||
|
|
||||||
|
|
||||||
|
def disable_parallel():
|
||||||
|
global pool, dt, cut, cut_for_search
|
||||||
|
if pool:
|
||||||
|
pool.close()
|
||||||
|
pool = None
|
||||||
|
cut = dt.cut
|
||||||
|
cut_for_search = dt.cut_for_search
|
||||||
|
@ -1,103 +1,18 @@
|
|||||||
#encoding=utf-8
|
|
||||||
from __future__ import absolute_import
|
from __future__ import absolute_import
|
||||||
import jieba
|
from .tfidf import TFIDF
|
||||||
import jieba.posseg
|
from .textrank import TextRank
|
||||||
import os
|
|
||||||
from operator import itemgetter
|
|
||||||
from .textrank import textrank
|
|
||||||
try:
|
try:
|
||||||
from .analyzer import ChineseAnalyzer
|
from .analyzer import ChineseAnalyzer
|
||||||
except ImportError:
|
except ImportError:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
_curpath = os.path.normpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
default_tfidf = TFIDF()
|
||||||
abs_path = os.path.join(_curpath, "idf.txt")
|
default_textrank = TextRank()
|
||||||
|
|
||||||
STOP_WORDS = set((
|
extract_tags = tfidf = default_tfidf.extract_tags
|
||||||
"the","of","is","and","to","in","that","we","for","an","are",
|
set_idf_path = default_tfidf.set_idf_path
|
||||||
"by","be","as","on","with","can","if","from","which","you","it",
|
textrank = default_textrank.extract_tags
|
||||||
"this","then","at","have","all","not","one","has","or","that"
|
|
||||||
))
|
|
||||||
|
|
||||||
class IDFLoader:
|
|
||||||
def __init__(self):
|
|
||||||
self.path = ""
|
|
||||||
self.idf_freq = {}
|
|
||||||
self.median_idf = 0.0
|
|
||||||
|
|
||||||
def set_new_path(self, new_idf_path):
|
|
||||||
if self.path != new_idf_path:
|
|
||||||
content = open(new_idf_path, 'rb').read().decode('utf-8')
|
|
||||||
idf_freq = {}
|
|
||||||
lines = content.rstrip('\n').split('\n')
|
|
||||||
for line in lines:
|
|
||||||
word, freq = line.split(' ')
|
|
||||||
idf_freq[word] = float(freq)
|
|
||||||
median_idf = sorted(idf_freq.values())[len(idf_freq)//2]
|
|
||||||
self.idf_freq = idf_freq
|
|
||||||
self.median_idf = median_idf
|
|
||||||
self.path = new_idf_path
|
|
||||||
|
|
||||||
def get_idf(self):
|
|
||||||
return self.idf_freq, self.median_idf
|
|
||||||
|
|
||||||
idf_loader = IDFLoader()
|
|
||||||
idf_loader.set_new_path(abs_path)
|
|
||||||
|
|
||||||
def set_idf_path(idf_path):
|
|
||||||
new_abs_path = os.path.normpath(os.path.join(os.getcwd(), idf_path))
|
|
||||||
if not os.path.exists(new_abs_path):
|
|
||||||
raise Exception("jieba: path does not exist: " + new_abs_path)
|
|
||||||
idf_loader.set_new_path(new_abs_path)
|
|
||||||
|
|
||||||
def set_stop_words(stop_words_path):
|
def set_stop_words(stop_words_path):
|
||||||
global STOP_WORDS
|
default_tfidf.set_stop_words(stop_words_path)
|
||||||
abs_path = os.path.normpath(os.path.join(os.getcwd(), stop_words_path))
|
default_textrank.set_stop_words(stop_words_path)
|
||||||
if not os.path.exists(abs_path):
|
|
||||||
raise Exception("jieba: path does not exist: " + abs_path)
|
|
||||||
content = open(abs_path,'rb').read().decode('utf-8')
|
|
||||||
lines = content.replace("\r", "").split('\n')
|
|
||||||
for line in lines:
|
|
||||||
STOP_WORDS.add(line)
|
|
||||||
|
|
||||||
def extract_tags(sentence, topK=20, withWeight=False, allowPOS=[]):
|
|
||||||
"""
|
|
||||||
Extract keywords from sentence using TF-IDF algorithm.
|
|
||||||
Parameter:
|
|
||||||
- topK: return how many top keywords. `None` for all possible words.
|
|
||||||
- withWeight: if True, return a list of (word, weight);
|
|
||||||
if False, return a list of words.
|
|
||||||
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
|
|
||||||
if the POS of w is not in this list,it will be filtered.
|
|
||||||
"""
|
|
||||||
global STOP_WORDS, idf_loader
|
|
||||||
|
|
||||||
idf_freq, median_idf = idf_loader.get_idf()
|
|
||||||
|
|
||||||
if allowPOS:
|
|
||||||
allowPOS = frozenset(allowPOS)
|
|
||||||
words = jieba.posseg.cut(sentence)
|
|
||||||
else:
|
|
||||||
words = jieba.cut(sentence)
|
|
||||||
freq = {}
|
|
||||||
for w in words:
|
|
||||||
if allowPOS:
|
|
||||||
if w.flag not in allowPOS:
|
|
||||||
continue
|
|
||||||
else:
|
|
||||||
w = w.word
|
|
||||||
if len(w.strip()) < 2 or w.lower() in STOP_WORDS:
|
|
||||||
continue
|
|
||||||
freq[w] = freq.get(w, 0.0) + 1.0
|
|
||||||
total = sum(freq.values())
|
|
||||||
for k in freq:
|
|
||||||
freq[k] *= idf_freq.get(k, median_idf) / total
|
|
||||||
|
|
||||||
if withWeight:
|
|
||||||
tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
|
|
||||||
else:
|
|
||||||
tags = sorted(freq, key=freq.__getitem__, reverse=True)
|
|
||||||
if topK:
|
|
||||||
return tags[:topK]
|
|
||||||
else:
|
|
||||||
return tags
|
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
#encoding=utf-8
|
# encoding=utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
from whoosh.analysis import RegexAnalyzer,LowercaseFilter,StopFilter,StemFilter
|
from whoosh.analysis import RegexAnalyzer, LowercaseFilter, StopFilter, StemFilter
|
||||||
from whoosh.analysis import Tokenizer,Token
|
from whoosh.analysis import Tokenizer, Token
|
||||||
from whoosh.lang.porter import stem
|
from whoosh.lang.porter import stem
|
||||||
|
|
||||||
import jieba
|
import jieba
|
||||||
@ -15,12 +15,14 @@ STOP_WORDS = frozenset(('a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'can',
|
|||||||
|
|
||||||
accepted_chars = re.compile(r"[\u4E00-\u9FA5]+")
|
accepted_chars = re.compile(r"[\u4E00-\u9FA5]+")
|
||||||
|
|
||||||
|
|
||||||
class ChineseTokenizer(Tokenizer):
|
class ChineseTokenizer(Tokenizer):
|
||||||
|
|
||||||
def __call__(self, text, **kargs):
|
def __call__(self, text, **kargs):
|
||||||
words = jieba.tokenize(text, mode="search")
|
words = jieba.tokenize(text, mode="search")
|
||||||
token = Token()
|
token = Token()
|
||||||
for (w,start_pos,stop_pos) in words:
|
for (w, start_pos, stop_pos) in words:
|
||||||
if not accepted_chars.match(w) and len(w)<=1:
|
if not accepted_chars.match(w) and len(w) <= 1:
|
||||||
continue
|
continue
|
||||||
token.original = token.text = w
|
token.original = token.text = w
|
||||||
token.pos = start_pos
|
token.pos = start_pos
|
||||||
@ -28,7 +30,8 @@ class ChineseTokenizer(Tokenizer):
|
|||||||
token.endchar = stop_pos
|
token.endchar = stop_pos
|
||||||
yield token
|
yield token
|
||||||
|
|
||||||
|
|
||||||
def ChineseAnalyzer(stoplist=STOP_WORDS, minsize=1, stemfn=stem, cachesize=50000):
|
def ChineseAnalyzer(stoplist=STOP_WORDS, minsize=1, stemfn=stem, cachesize=50000):
|
||||||
return (ChineseTokenizer() | LowercaseFilter() |
|
return (ChineseTokenizer() | LowercaseFilter() |
|
||||||
StopFilter(stoplist=stoplist,minsize=minsize) |
|
StopFilter(stoplist=stoplist, minsize=minsize) |
|
||||||
StemFilter(stemfn=stemfn, ignore=None,cachesize=cachesize))
|
StemFilter(stemfn=stemfn, ignore=None, cachesize=cachesize))
|
||||||
|
@ -3,9 +3,10 @@
|
|||||||
|
|
||||||
from __future__ import absolute_import, unicode_literals
|
from __future__ import absolute_import, unicode_literals
|
||||||
import sys
|
import sys
|
||||||
import collections
|
|
||||||
from operator import itemgetter
|
from operator import itemgetter
|
||||||
import jieba.posseg as pseg
|
from collections import defaultdict
|
||||||
|
import jieba.posseg
|
||||||
|
from .tfidf import KeywordExtractor
|
||||||
from .._compat import *
|
from .._compat import *
|
||||||
|
|
||||||
|
|
||||||
@ -13,7 +14,7 @@ class UndirectWeightedGraph:
|
|||||||
d = 0.85
|
d = 0.85
|
||||||
|
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.graph = collections.defaultdict(list)
|
self.graph = defaultdict(list)
|
||||||
|
|
||||||
def addEdge(self, start, end, weight):
|
def addEdge(self, start, end, weight):
|
||||||
# use a tuple (start, end, weight) instead of a Edge object
|
# use a tuple (start, end, weight) instead of a Edge object
|
||||||
@ -21,8 +22,8 @@ class UndirectWeightedGraph:
|
|||||||
self.graph[end].append((end, start, weight))
|
self.graph[end].append((end, start, weight))
|
||||||
|
|
||||||
def rank(self):
|
def rank(self):
|
||||||
ws = collections.defaultdict(float)
|
ws = defaultdict(float)
|
||||||
outSum = collections.defaultdict(float)
|
outSum = defaultdict(float)
|
||||||
|
|
||||||
wsdef = 1.0 / (len(self.graph) or 1.0)
|
wsdef = 1.0 / (len(self.graph) or 1.0)
|
||||||
for n, out in self.graph.items():
|
for n, out in self.graph.items():
|
||||||
@ -53,7 +54,19 @@ class UndirectWeightedGraph:
|
|||||||
return ws
|
return ws
|
||||||
|
|
||||||
|
|
||||||
def textrank(sentence, topK=10, withWeight=False, allowPOS=['ns', 'n', 'vn', 'v']):
|
class TextRank(KeywordExtractor):
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.tokenizer = self.postokenizer = jieba.posseg.dt
|
||||||
|
self.stop_words = self.STOP_WORDS.copy()
|
||||||
|
self.pos_filt = frozenset(('ns', 'n', 'vn', 'v'))
|
||||||
|
self.span = 5
|
||||||
|
|
||||||
|
def pairfilter(self, wp):
|
||||||
|
return (wp.flag in self.pos_filt and len(wp.word.strip()) >= 2
|
||||||
|
and wp.word.lower() not in self.stop_words)
|
||||||
|
|
||||||
|
def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')):
|
||||||
"""
|
"""
|
||||||
Extract keywords from sentence using TextRank algorithm.
|
Extract keywords from sentence using TextRank algorithm.
|
||||||
Parameter:
|
Parameter:
|
||||||
@ -61,21 +74,20 @@ def textrank(sentence, topK=10, withWeight=False, allowPOS=['ns', 'n', 'vn', 'v'
|
|||||||
- withWeight: if True, return a list of (word, weight);
|
- withWeight: if True, return a list of (word, weight);
|
||||||
if False, return a list of words.
|
if False, return a list of words.
|
||||||
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
|
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
|
||||||
if the POS of w is not in this list,it will be filtered.
|
if the POS of w is not in this list, it will be filtered.
|
||||||
"""
|
"""
|
||||||
pos_filt = frozenset(allowPOS)
|
self.pos_filt = frozenset(allowPOS)
|
||||||
g = UndirectWeightedGraph()
|
g = UndirectWeightedGraph()
|
||||||
cm = collections.defaultdict(int)
|
cm = defaultdict(int)
|
||||||
span = 5
|
words = tuple(self.tokenizer.cut(sentence))
|
||||||
words = list(pseg.cut(sentence))
|
for i, wp in enumerate(words):
|
||||||
for i in xrange(len(words)):
|
if self.pairfilter(wp):
|
||||||
if words[i].flag in pos_filt:
|
for j in xrange(i + 1, i + self.span):
|
||||||
for j in xrange(i + 1, i + span):
|
|
||||||
if j >= len(words):
|
if j >= len(words):
|
||||||
break
|
break
|
||||||
if words[j].flag not in pos_filt:
|
if not self.pairfilter(words[j]):
|
||||||
continue
|
continue
|
||||||
cm[(words[i].word, words[j].word)] += 1
|
cm[(wp.word, words[j].word)] += 1
|
||||||
|
|
||||||
for terms, w in cm.items():
|
for terms, w in cm.items():
|
||||||
g.addEdge(terms[0], terms[1], w)
|
g.addEdge(terms[0], terms[1], w)
|
||||||
@ -89,7 +101,4 @@ def textrank(sentence, topK=10, withWeight=False, allowPOS=['ns', 'n', 'vn', 'v'
|
|||||||
else:
|
else:
|
||||||
return tags
|
return tags
|
||||||
|
|
||||||
if __name__ == '__main__':
|
extract_tags = textrank
|
||||||
s = "此外,公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元,增资后,吉林欧亚置业注册资本由7000万元增加到5亿元。吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城市商业综合体项目。2013年,实现营业收入0万元,实现净利润-139.13万元。"
|
|
||||||
for x, w in textrank(s, withWeight=True):
|
|
||||||
print('%s %s' % (x, w))
|
|
||||||
|
111
jieba/analyse/tfidf.py
Executable file
111
jieba/analyse/tfidf.py
Executable file
@ -0,0 +1,111 @@
|
|||||||
|
# encoding=utf-8
|
||||||
|
from __future__ import absolute_import
|
||||||
|
import os
|
||||||
|
import jieba
|
||||||
|
import jieba.posseg
|
||||||
|
from operator import itemgetter
|
||||||
|
|
||||||
|
_get_module_path = lambda path: os.path.normpath(os.path.join(os.getcwd(),
|
||||||
|
os.path.dirname(__file__), path))
|
||||||
|
_get_abs_path = jieba._get_abs_path
|
||||||
|
|
||||||
|
DEFAULT_IDF = _get_module_path("idf.txt")
|
||||||
|
|
||||||
|
|
||||||
|
class KeywordExtractor(object):
|
||||||
|
|
||||||
|
STOP_WORDS = set((
|
||||||
|
"the", "of", "is", "and", "to", "in", "that", "we", "for", "an", "are",
|
||||||
|
"by", "be", "as", "on", "with", "can", "if", "from", "which", "you", "it",
|
||||||
|
"this", "then", "at", "have", "all", "not", "one", "has", "or", "that"
|
||||||
|
))
|
||||||
|
|
||||||
|
def set_stop_words(self, stop_words_path):
|
||||||
|
abs_path = _get_abs_path(stop_words_path)
|
||||||
|
if not os.path.isfile(abs_path):
|
||||||
|
raise Exception("jieba: file does not exist: " + abs_path)
|
||||||
|
content = open(abs_path, 'rb').read().decode('utf-8')
|
||||||
|
for line in content.splitlines():
|
||||||
|
self.stop_words.add(line)
|
||||||
|
|
||||||
|
def extract_tags(self, *args, **kwargs):
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
|
||||||
|
class IDFLoader(object):
|
||||||
|
|
||||||
|
def __init__(self, idf_path=None):
|
||||||
|
self.path = ""
|
||||||
|
self.idf_freq = {}
|
||||||
|
self.median_idf = 0.0
|
||||||
|
if idf_path:
|
||||||
|
self.set_new_path(idf_path)
|
||||||
|
|
||||||
|
def set_new_path(self, new_idf_path):
|
||||||
|
if self.path != new_idf_path:
|
||||||
|
self.path = new_idf_path
|
||||||
|
content = open(new_idf_path, 'rb').read().decode('utf-8')
|
||||||
|
self.idf_freq = {}
|
||||||
|
for line in content.splitlines():
|
||||||
|
word, freq = line.strip().split(' ')
|
||||||
|
self.idf_freq[word] = float(freq)
|
||||||
|
self.median_idf = sorted(
|
||||||
|
self.idf_freq.values())[len(self.idf_freq) // 2]
|
||||||
|
|
||||||
|
def get_idf(self):
|
||||||
|
return self.idf_freq, self.median_idf
|
||||||
|
|
||||||
|
|
||||||
|
class TFIDF(KeywordExtractor):
|
||||||
|
|
||||||
|
def __init__(self, idf_path=None):
|
||||||
|
self.tokenizer = jieba.dt
|
||||||
|
self.postokenizer = jieba.posseg.dt
|
||||||
|
self.stop_words = self.STOP_WORDS.copy()
|
||||||
|
self.idf_loader = IDFLoader(idf_path or DEFAULT_IDF)
|
||||||
|
self.idf_freq, self.median_idf = self.idf_loader.get_idf()
|
||||||
|
|
||||||
|
def set_idf_path(self, idf_path):
|
||||||
|
new_abs_path = _get_abs_path(idf_path)
|
||||||
|
if not os.path.isfile(new_abs_path):
|
||||||
|
raise Exception("jieba: file does not exist: " + new_abs_path)
|
||||||
|
self.idf_loader.set_new_path(new_abs_path)
|
||||||
|
self.idf_freq, self.median_idf = self.idf_loader.get_idf()
|
||||||
|
|
||||||
|
def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=()):
|
||||||
|
"""
|
||||||
|
Extract keywords from sentence using TF-IDF algorithm.
|
||||||
|
Parameter:
|
||||||
|
- topK: return how many top keywords. `None` for all possible words.
|
||||||
|
- withWeight: if True, return a list of (word, weight);
|
||||||
|
if False, return a list of words.
|
||||||
|
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
|
||||||
|
if the POS of w is not in this list,it will be filtered.
|
||||||
|
"""
|
||||||
|
if allowPOS:
|
||||||
|
allowPOS = frozenset(allowPOS)
|
||||||
|
words = self.postokenizer.cut(sentence)
|
||||||
|
else:
|
||||||
|
words = self.tokenizer.cut(sentence)
|
||||||
|
freq = {}
|
||||||
|
for w in words:
|
||||||
|
if allowPOS:
|
||||||
|
if w.flag not in allowPOS:
|
||||||
|
continue
|
||||||
|
else:
|
||||||
|
w = w.word
|
||||||
|
if len(w.strip()) < 2 or w.lower() in self.stop_words:
|
||||||
|
continue
|
||||||
|
freq[w] = freq.get(w, 0.0) + 1.0
|
||||||
|
total = sum(freq.values())
|
||||||
|
for k in freq:
|
||||||
|
freq[k] *= self.idf_freq.get(k, self.median_idf) / total
|
||||||
|
|
||||||
|
if withWeight:
|
||||||
|
tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
|
||||||
|
else:
|
||||||
|
tags = sorted(freq, key=freq.__getitem__, reverse=True)
|
||||||
|
if topK:
|
||||||
|
return tags[:topK]
|
||||||
|
else:
|
||||||
|
return tags
|
@ -1,10 +1,9 @@
|
|||||||
from __future__ import absolute_import, unicode_literals
|
from __future__ import absolute_import, unicode_literals
|
||||||
import re
|
|
||||||
import os
|
import os
|
||||||
import jieba
|
import re
|
||||||
import sys
|
import sys
|
||||||
|
import jieba
|
||||||
import marshal
|
import marshal
|
||||||
from functools import wraps
|
|
||||||
from .._compat import *
|
from .._compat import *
|
||||||
from .viterbi import viterbi
|
from .viterbi import viterbi
|
||||||
|
|
||||||
@ -24,23 +23,10 @@ re_num = re.compile("[\.0-9]+")
|
|||||||
re_eng1 = re.compile('^[a-zA-Z0-9]$', re.U)
|
re_eng1 = re.compile('^[a-zA-Z0-9]$', re.U)
|
||||||
|
|
||||||
|
|
||||||
def load_model(f_name, isJython=True):
|
def load_model(f_name):
|
||||||
_curpath = os.path.normpath(
|
_curpath = os.path.normpath(
|
||||||
os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
||||||
|
# For Jython
|
||||||
result = {}
|
|
||||||
with open(f_name, "rb") as f:
|
|
||||||
for line in f:
|
|
||||||
line = line.strip()
|
|
||||||
if not line:
|
|
||||||
continue
|
|
||||||
line = line.decode("utf-8")
|
|
||||||
word, _, tag = line.split(" ")
|
|
||||||
result[word] = tag
|
|
||||||
|
|
||||||
if not isJython:
|
|
||||||
return result
|
|
||||||
|
|
||||||
start_p = {}
|
start_p = {}
|
||||||
abs_path = os.path.join(_curpath, PROB_START_P)
|
abs_path = os.path.join(_curpath, PROB_START_P)
|
||||||
with open(abs_path, 'rb') as f:
|
with open(abs_path, 'rb') as f:
|
||||||
@ -64,29 +50,15 @@ def load_model(f_name, isJython=True):
|
|||||||
|
|
||||||
return state, start_p, trans_p, emit_p, result
|
return state, start_p, trans_p, emit_p, result
|
||||||
|
|
||||||
|
|
||||||
if sys.platform.startswith("java"):
|
if sys.platform.startswith("java"):
|
||||||
char_state_tab_P, start_P, trans_P, emit_P, word_tag_tab = load_model(
|
char_state_tab_P, start_P, trans_P, emit_P, word_tag_tab = load_model()
|
||||||
jieba.get_abs_path_dict())
|
|
||||||
else:
|
else:
|
||||||
from .char_state_tab import P as char_state_tab_P
|
from .char_state_tab import P as char_state_tab_P
|
||||||
from .prob_start import P as start_P
|
from .prob_start import P as start_P
|
||||||
from .prob_trans import P as trans_P
|
from .prob_trans import P as trans_P
|
||||||
from .prob_emit import P as emit_P
|
from .prob_emit import P as emit_P
|
||||||
|
|
||||||
word_tag_tab = load_model(jieba.get_abs_path_dict(), isJython=False)
|
|
||||||
|
|
||||||
|
|
||||||
def makesure_userdict_loaded(fn):
|
|
||||||
|
|
||||||
@wraps(fn)
|
|
||||||
def wrapped(*args, **kwargs):
|
|
||||||
if jieba.user_word_tag_tab:
|
|
||||||
word_tag_tab.update(jieba.user_word_tag_tab)
|
|
||||||
jieba.user_word_tag_tab = {}
|
|
||||||
return fn(*args, **kwargs)
|
|
||||||
|
|
||||||
return wrapped
|
|
||||||
|
|
||||||
|
|
||||||
class pair(object):
|
class pair(object):
|
||||||
|
|
||||||
@ -110,7 +82,45 @@ class pair(object):
|
|||||||
return self.__unicode__().encode(arg)
|
return self.__unicode__().encode(arg)
|
||||||
|
|
||||||
|
|
||||||
def __cut(sentence):
|
class POSTokenizer(object):
|
||||||
|
|
||||||
|
def __init__(self, tokenizer=None):
|
||||||
|
self.tokenizer = tokenizer or jieba.Tokenizer()
|
||||||
|
self.load_word_tag(self.tokenizer.get_abs_path_dict())
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return '<POSTokenizer tokenizer=%r>' % self.tokenizer
|
||||||
|
|
||||||
|
def __getattr__(self, name):
|
||||||
|
if name in ('cut_for_search', 'lcut_for_search', 'tokenize'):
|
||||||
|
# may be possible?
|
||||||
|
raise NotImplementedError
|
||||||
|
return getattr(self.tokenizer, name)
|
||||||
|
|
||||||
|
def initialize(self, dictionary=None):
|
||||||
|
self.tokenizer.initialize(dictionary)
|
||||||
|
self.load_word_tag(self.tokenizer.get_abs_path_dict())
|
||||||
|
|
||||||
|
def load_word_tag(self, f_name):
|
||||||
|
self.word_tag_tab = {}
|
||||||
|
with open(f_name, "rb") as f:
|
||||||
|
for lineno, line in enumerate(f, 1):
|
||||||
|
try:
|
||||||
|
line = line.strip().decode("utf-8")
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
word, _, tag = line.split(" ")
|
||||||
|
self.word_tag_tab[word] = tag
|
||||||
|
except Exception:
|
||||||
|
raise ValueError(
|
||||||
|
'invalid POS dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
|
||||||
|
|
||||||
|
def makesure_userdict_loaded(self):
|
||||||
|
if self.tokenizer.user_word_tag_tab:
|
||||||
|
self.word_tag_tab.update(self.tokenizer.user_word_tag_tab)
|
||||||
|
self.tokenizer.user_word_tag_tab = {}
|
||||||
|
|
||||||
|
def __cut(self, sentence):
|
||||||
prob, pos_list = viterbi(
|
prob, pos_list = viterbi(
|
||||||
sentence, char_state_tab_P, start_P, trans_P, emit_P)
|
sentence, char_state_tab_P, start_P, trans_P, emit_P)
|
||||||
begin, nexti = 0, 0
|
begin, nexti = 0, 0
|
||||||
@ -128,12 +138,11 @@ def __cut(sentence):
|
|||||||
if nexti < len(sentence):
|
if nexti < len(sentence):
|
||||||
yield pair(sentence[nexti:], pos_list[nexti][1])
|
yield pair(sentence[nexti:], pos_list[nexti][1])
|
||||||
|
|
||||||
|
def __cut_detail(self, sentence):
|
||||||
def __cut_detail(sentence):
|
|
||||||
blocks = re_han_detail.split(sentence)
|
blocks = re_han_detail.split(sentence)
|
||||||
for blk in blocks:
|
for blk in blocks:
|
||||||
if re_han_detail.match(blk):
|
if re_han_detail.match(blk):
|
||||||
for word in __cut(blk):
|
for word in self.__cut(blk):
|
||||||
yield word
|
yield word
|
||||||
else:
|
else:
|
||||||
tmp = re_skip_detail.split(blk)
|
tmp = re_skip_detail.split(blk)
|
||||||
@ -146,11 +155,10 @@ def __cut_detail(sentence):
|
|||||||
else:
|
else:
|
||||||
yield pair(x, 'x')
|
yield pair(x, 'x')
|
||||||
|
|
||||||
|
def __cut_DAG_NO_HMM(self, sentence):
|
||||||
def __cut_DAG_NO_HMM(sentence):
|
DAG = self.tokenizer.get_DAG(sentence)
|
||||||
DAG = jieba.get_DAG(sentence)
|
|
||||||
route = {}
|
route = {}
|
||||||
jieba.calc(sentence, DAG, route)
|
self.tokenizer.calc(sentence, DAG, route)
|
||||||
x = 0
|
x = 0
|
||||||
N = len(sentence)
|
N = len(sentence)
|
||||||
buf = ''
|
buf = ''
|
||||||
@ -164,18 +172,17 @@ def __cut_DAG_NO_HMM(sentence):
|
|||||||
if buf:
|
if buf:
|
||||||
yield pair(buf, 'eng')
|
yield pair(buf, 'eng')
|
||||||
buf = ''
|
buf = ''
|
||||||
yield pair(l_word, word_tag_tab.get(l_word, 'x'))
|
yield pair(l_word, self.word_tag_tab.get(l_word, 'x'))
|
||||||
x = y
|
x = y
|
||||||
if buf:
|
if buf:
|
||||||
yield pair(buf, 'eng')
|
yield pair(buf, 'eng')
|
||||||
buf = ''
|
buf = ''
|
||||||
|
|
||||||
|
def __cut_DAG(self, sentence):
|
||||||
def __cut_DAG(sentence):
|
DAG = self.tokenizer.get_DAG(sentence)
|
||||||
DAG = jieba.get_DAG(sentence)
|
|
||||||
route = {}
|
route = {}
|
||||||
|
|
||||||
jieba.calc(sentence, DAG, route)
|
self.tokenizer.calc(sentence, DAG, route)
|
||||||
|
|
||||||
x = 0
|
x = 0
|
||||||
buf = ''
|
buf = ''
|
||||||
@ -188,41 +195,41 @@ def __cut_DAG(sentence):
|
|||||||
else:
|
else:
|
||||||
if buf:
|
if buf:
|
||||||
if len(buf) == 1:
|
if len(buf) == 1:
|
||||||
yield pair(buf, word_tag_tab.get(buf, 'x'))
|
yield pair(buf, self.word_tag_tab.get(buf, 'x'))
|
||||||
elif not jieba.FREQ.get(buf):
|
elif not self.tokenizer.FREQ.get(buf):
|
||||||
recognized = __cut_detail(buf)
|
recognized = self.__cut_detail(buf)
|
||||||
for t in recognized:
|
for t in recognized:
|
||||||
yield t
|
yield t
|
||||||
else:
|
else:
|
||||||
for elem in buf:
|
for elem in buf:
|
||||||
yield pair(elem, word_tag_tab.get(elem, 'x'))
|
yield pair(elem, self.word_tag_tab.get(elem, 'x'))
|
||||||
buf = ''
|
buf = ''
|
||||||
yield pair(l_word, word_tag_tab.get(l_word, 'x'))
|
yield pair(l_word, self.word_tag_tab.get(l_word, 'x'))
|
||||||
x = y
|
x = y
|
||||||
|
|
||||||
if buf:
|
if buf:
|
||||||
if len(buf) == 1:
|
if len(buf) == 1:
|
||||||
yield pair(buf, word_tag_tab.get(buf, 'x'))
|
yield pair(buf, self.word_tag_tab.get(buf, 'x'))
|
||||||
elif not jieba.FREQ.get(buf):
|
elif not self.tokenizer.FREQ.get(buf):
|
||||||
recognized = __cut_detail(buf)
|
recognized = self.__cut_detail(buf)
|
||||||
for t in recognized:
|
for t in recognized:
|
||||||
yield t
|
yield t
|
||||||
else:
|
else:
|
||||||
for elem in buf:
|
for elem in buf:
|
||||||
yield pair(elem, word_tag_tab.get(elem, 'x'))
|
yield pair(elem, self.word_tag_tab.get(elem, 'x'))
|
||||||
|
|
||||||
|
def __cut_internal(self, sentence, HMM=True):
|
||||||
def __cut_internal(sentence, HMM=True):
|
self.makesure_userdict_loaded()
|
||||||
sentence = strdecode(sentence)
|
sentence = strdecode(sentence)
|
||||||
blocks = re_han_internal.split(sentence)
|
blocks = re_han_internal.split(sentence)
|
||||||
if HMM:
|
if HMM:
|
||||||
__cut_blk = __cut_DAG
|
cut_blk = self.__cut_DAG
|
||||||
else:
|
else:
|
||||||
__cut_blk = __cut_DAG_NO_HMM
|
cut_blk = self.__cut_DAG_NO_HMM
|
||||||
|
|
||||||
for blk in blocks:
|
for blk in blocks:
|
||||||
if re_han_internal.match(blk):
|
if re_han_internal.match(blk):
|
||||||
for word in __cut_blk(blk):
|
for word in cut_blk(blk):
|
||||||
yield word
|
yield word
|
||||||
else:
|
else:
|
||||||
tmp = re_skip_internal.split(blk)
|
tmp = re_skip_internal.split(blk)
|
||||||
@ -238,26 +245,57 @@ def __cut_internal(sentence, HMM=True):
|
|||||||
else:
|
else:
|
||||||
yield pair(xx, 'x')
|
yield pair(xx, 'x')
|
||||||
|
|
||||||
|
def _lcut_internal(self, sentence):
|
||||||
|
return list(self.__cut_internal(sentence))
|
||||||
|
|
||||||
def __lcut_internal(sentence):
|
def _lcut_internal_no_hmm(self, sentence):
|
||||||
return list(__cut_internal(sentence))
|
return list(self.__cut_internal(sentence, False))
|
||||||
|
|
||||||
|
def cut(self, sentence, HMM=True):
|
||||||
|
for w in self.__cut_internal(sentence, HMM=HMM):
|
||||||
|
yield w
|
||||||
|
|
||||||
|
def lcut(self, *args, **kwargs):
|
||||||
|
return list(self.cut(*args, **kwargs))
|
||||||
|
|
||||||
|
# default Tokenizer instance
|
||||||
|
|
||||||
|
dt = POSTokenizer(jieba.dt)
|
||||||
|
|
||||||
|
# global functions
|
||||||
|
|
||||||
|
initialize = dt.initialize
|
||||||
|
|
||||||
|
|
||||||
def __lcut_internal_no_hmm(sentence):
|
def _lcut_internal(s):
|
||||||
return list(__cut_internal(sentence, False))
|
return dt._lcut_internal(s)
|
||||||
|
|
||||||
|
|
||||||
|
def _lcut_internal_no_hmm(s):
|
||||||
|
return dt._lcut_internal_no_hmm(s)
|
||||||
|
|
||||||
|
|
||||||
@makesure_userdict_loaded
|
|
||||||
def cut(sentence, HMM=True):
|
def cut(sentence, HMM=True):
|
||||||
|
"""
|
||||||
|
Global `cut` function that supports parallel processing.
|
||||||
|
|
||||||
|
Note that this only works using dt, custom POSTokenizer
|
||||||
|
instances are not supported.
|
||||||
|
"""
|
||||||
|
global dt
|
||||||
if jieba.pool is None:
|
if jieba.pool is None:
|
||||||
for w in __cut_internal(sentence, HMM=HMM):
|
for w in dt.cut(sentence, HMM=HMM):
|
||||||
yield w
|
yield w
|
||||||
else:
|
else:
|
||||||
parts = strdecode(sentence).splitlines(True)
|
parts = strdecode(sentence).splitlines(True)
|
||||||
if HMM:
|
if HMM:
|
||||||
result = jieba.pool.map(__lcut_internal, parts)
|
result = jieba.pool.map(_lcut_internal, parts)
|
||||||
else:
|
else:
|
||||||
result = jieba.pool.map(__lcut_internal_no_hmm, parts)
|
result = jieba.pool.map(_lcut_internal_no_hmm, parts)
|
||||||
for r in result:
|
for r in result:
|
||||||
for w in r:
|
for w in r:
|
||||||
yield w
|
yield w
|
||||||
|
|
||||||
|
|
||||||
|
def lcut(sentence, HMM=True):
|
||||||
|
return list(cut(sentence, HMM))
|
||||||
|
66
test/demo.py
66
test/demo.py
@ -4,6 +4,12 @@ import sys
|
|||||||
sys.path.append("../")
|
sys.path.append("../")
|
||||||
|
|
||||||
import jieba
|
import jieba
|
||||||
|
import jieba.posseg
|
||||||
|
import jieba.analyse
|
||||||
|
|
||||||
|
print('='*40)
|
||||||
|
print('1. 分词')
|
||||||
|
print('-'*40)
|
||||||
|
|
||||||
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
|
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
|
||||||
print("Full Mode: " + "/ ".join(seg_list)) # 全模式
|
print("Full Mode: " + "/ ".join(seg_list)) # 全模式
|
||||||
@ -16,3 +22,63 @@ print(", ".join(seg_list))
|
|||||||
|
|
||||||
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
|
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
|
||||||
print(", ".join(seg_list))
|
print(", ".join(seg_list))
|
||||||
|
|
||||||
|
print('='*40)
|
||||||
|
print('2. 添加自定义词典/调整词典')
|
||||||
|
print('-'*40)
|
||||||
|
|
||||||
|
print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
|
||||||
|
#如果/放到/post/中将/出错/。
|
||||||
|
print(jieba.suggest_freq(('中', '将'), True))
|
||||||
|
#494
|
||||||
|
print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
|
||||||
|
#如果/放到/post/中/将/出错/。
|
||||||
|
print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
|
||||||
|
#「/台/中/」/正确/应该/不会/被/切开
|
||||||
|
print(jieba.suggest_freq('台中', True))
|
||||||
|
#69
|
||||||
|
print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
|
||||||
|
#「/台中/」/正确/应该/不会/被/切开
|
||||||
|
|
||||||
|
print('='*40)
|
||||||
|
print('3. 关键词提取')
|
||||||
|
print('-'*40)
|
||||||
|
print(' TF-IDF')
|
||||||
|
print('-'*40)
|
||||||
|
|
||||||
|
s = "此外,公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元,增资后,吉林欧亚置业注册资本由7000万元增加到5亿元。吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城市商业综合体项目。2013年,实现营业收入0万元,实现净利润-139.13万元。"
|
||||||
|
for x, w in jieba.analyse.extract_tags(s, withWeight=True):
|
||||||
|
print('%s %s' % (x, w))
|
||||||
|
|
||||||
|
print('-'*40)
|
||||||
|
print(' TextRank')
|
||||||
|
print('-'*40)
|
||||||
|
|
||||||
|
for x, w in jieba.analyse.textrank(s, withWeight=True):
|
||||||
|
print('%s %s' % (x, w))
|
||||||
|
|
||||||
|
print('='*40)
|
||||||
|
print('4. 词性标注')
|
||||||
|
print('-'*40)
|
||||||
|
|
||||||
|
words = jieba.posseg.cut("我爱北京天安门")
|
||||||
|
for w in words:
|
||||||
|
print('%s %s' % (w.word, w.flag))
|
||||||
|
|
||||||
|
print('='*40)
|
||||||
|
print('6. Tokenize: 返回词语在原文的起止位置')
|
||||||
|
print('-'*40)
|
||||||
|
print(' 默认模式')
|
||||||
|
print('-'*40)
|
||||||
|
|
||||||
|
result = jieba.tokenize('永和服装饰品有限公司')
|
||||||
|
for tk in result:
|
||||||
|
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
|
||||||
|
|
||||||
|
print('-'*40)
|
||||||
|
print(' 搜索模式')
|
||||||
|
print('-'*40)
|
||||||
|
|
||||||
|
result = jieba.tokenize('永和服装饰品有限公司', mode='search')
|
||||||
|
for tk in result:
|
||||||
|
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
|
||||||
|
42
test/test_lock.py
Normal file
42
test/test_lock.py
Normal file
@ -0,0 +1,42 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
|
||||||
|
import jieba
|
||||||
|
import threading
|
||||||
|
|
||||||
|
def inittokenizer(tokenizer, group):
|
||||||
|
print('===> Thread %s:%s started' % (group, threading.current_thread().ident))
|
||||||
|
tokenizer.initialize()
|
||||||
|
print('<=== Thread %s:%s finished' % (group, threading.current_thread().ident))
|
||||||
|
|
||||||
|
tokrs1 = [jieba.Tokenizer() for n in range(5)]
|
||||||
|
tokrs2 = [jieba.Tokenizer('../extra_dict/dict.txt.small') for n in range(5)]
|
||||||
|
|
||||||
|
thr1 = [threading.Thread(target=inittokenizer, args=(tokr, 1)) for tokr in tokrs1]
|
||||||
|
thr2 = [threading.Thread(target=inittokenizer, args=(tokr, 2)) for tokr in tokrs2]
|
||||||
|
for thr in thr1:
|
||||||
|
thr.start()
|
||||||
|
for thr in thr2:
|
||||||
|
thr.start()
|
||||||
|
for thr in thr1:
|
||||||
|
thr.join()
|
||||||
|
for thr in thr2:
|
||||||
|
thr.join()
|
||||||
|
|
||||||
|
del tokrs1, tokrs2
|
||||||
|
|
||||||
|
print('='*40)
|
||||||
|
|
||||||
|
tokr1 = jieba.Tokenizer()
|
||||||
|
tokr2 = jieba.Tokenizer('../extra_dict/dict.txt.small')
|
||||||
|
|
||||||
|
thr1 = [threading.Thread(target=inittokenizer, args=(tokr1, 1)) for n in range(5)]
|
||||||
|
thr2 = [threading.Thread(target=inittokenizer, args=(tokr2, 2)) for n in range(5)]
|
||||||
|
for thr in thr1:
|
||||||
|
thr.start()
|
||||||
|
for thr in thr2:
|
||||||
|
thr.start()
|
||||||
|
for thr in thr1:
|
||||||
|
thr.join()
|
||||||
|
for thr in thr2:
|
||||||
|
thr.join()
|
Loading…
x
Reference in New Issue
Block a user