wraps most globals in classes

API changes:
* class jieba.Tokenizer, jieba.posseg.POSTokenizer
* class jieba.analyse.TFIDF, jieba.analyse.TextRank
* global functions are mapped to jieba.(posseg.)dt, the default (POS)Tokenizer
* multiprocessing only works with jieba.(posseg.)dt
* new lcut, lcut_for_search functions that returns a list
* jieba.analyse.textrank now returns 20 items by default

Tests:
* added test_lock.py to test multithread locking
* demo.py now contains most of the examples in README
This commit is contained in:
Dingyuan Wang 2015-05-09 21:29:05 +08:00
parent e359d08964
commit 94840a734c
9 changed files with 1079 additions and 815 deletions

145
README.md
View File

@ -45,17 +45,19 @@ http://jiebademo.ap01.aws.af.cm/
主要功能
=======
1) 分词
1. 分词
--------
* `jieba.cut` 方法接受三个输入参数: 需要分词的字符串cut_all 参数用来控制是否采用全模式HMM 参数用来控制是否使用 HMM 模型
* `jieba.cut_for_search` 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
* 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意:不建议直接输入 GBK 字符串,可能无法预料地错误解码成 UTF-8
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator可以使用 for 循环来获得分词后得到的每一个词语(unicode),也可以用 list(jieba.cut(...)) 转化为 list
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator可以使用 for 循环来获得分词后得到的每一个词语(unicode),或者用
* `jieba.lcut` 以及 `jieba.lcut_for_search` 直接返回 list
* `jieba.Tokenizer(dictionary=DEFAULT_DICT)` 新建自定义分词器,可用于同时使用不同词典。`jieba.dt` 为默认分词器,所有全局分词相关函数都是该分词器的映射。
代码示例( 分词 )
代码示例
```python
#encoding=utf-8
# encoding=utf-8
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
@ -81,7 +83,7 @@ print(", ".join(seg_list))
【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
2) 添加自定义词典
2. 添加自定义词典
----------------
### 载入词典
@ -91,6 +93,8 @@ print(", ".join(seg_list))
* 词典格式和`dict.txt`一样,一个词占一行;每一行分三部分,一部分为词语,另一部分为词频(可省略),最后为词性(可省略),用空格隔开
* 词频可省略,使用计算出的能保证分出该词的词频
* 更改分词器的 tmp_dir 和 cache_file 属性,可指定缓存文件位置,用于受限的文件系统。
* 范例:
* 自定义词典https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
@ -128,12 +132,18 @@ print(", ".join(seg_list))
* "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
3) 关键词提取
3. 关键词提取
-------------
* jieba.analyse.extract_tags(sentence,topK,withWeight) #需要先 `import jieba.analyse`
* sentence 为待提取的文本
* topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
* withWeight 为是否一并返回关键词权重值,默认值为 False
### 基于 TF-IDF 算法的关键词抽取
`import jieba.analyse`
* jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
* sentence 为待提取的文本
* topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
* withWeight 为是否一并返回关键词权重值,默认值为 False
* allowPOS 仅包括指定词性的词,默认值为空,即不筛选
* jieba.analyse.TFIDF(idf_path=None) 新建 TFIDF 实例idf_path 为 IDF 频率文件
代码示例 (关键词提取)
@ -155,37 +165,27 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py
#### 基于TextRank算法的关键词抽取实现
### 基于 TextRank 算法的关键词抽取
* jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')) 直接使用,接口相同,注意默认过滤词性。
* jieba.analyse.TextRank() 新建自定义 TextRank 实例
算法论文: [TextRank: Bringing Order into Texts](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
##### 基本思想:
#### 基本思想:
1. 将待抽取关键词的文本进行分词
2. 以固定窗口大小(我选的5可适当调整),词之间的共现关系,构建图
2. 以固定窗口大小(默认为5通过span属性调整),词之间的共现关系,构建图
3. 计算图中节点的PageRank注意是无向带权图
##### 基本使用:
jieba.analyse.textrank(raw_text)
#### 使用示例:
##### 示例结果:
来自`__main__`的示例结果:
见 [test/demo.py](https://github.com/fxsjy/jieba/blob/master/test/demo.py)
```
吉林 1.0
欧亚 0.864834432786
置业 0.553465925497
实现 0.520660869531
收入 0.379699688954
增资 0.355086023683
子公司 0.349758490263
全资 0.308537396283
城市 0.306103738053
商业 0.304837414946
```
4) : 词性标注
4. 词性标注
-----------
* 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法
* `jieba.posseg.POSTokenizer(tokenizer=None)` 新建自定义分词器,`tokenizer` 参数可指定内部使用的 `jieba.Tokenizer` 分词器。`jieba.posseg.dt` 为默认词性标注分词器。
* 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法。
* 用法示例
```pycon
@ -200,10 +200,10 @@ jieba.analyse.textrank(raw_text)
天安门 ns
```
5) : 并行分词
5. 并行分词
-----------
* 原理:将目标文本按行分隔后,把各行文本分配到多个 python 进程并行分词,然后归并结果,从而获得分词速度的可观提升
* 基于 python 自带的 multiprocessing 模块,目前暂不支持 windows
* 原理:将目标文本按行分隔后,把各行文本分配到多个 Python 进程并行分词,然后归并结果,从而获得分词速度的可观提升
* 基于 python 自带的 multiprocessing 模块,目前暂不支持 Windows
* 用法:
* `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数
* `jieba.disable_parallel()` # 关闭并行分词模式
@ -212,8 +212,9 @@ jieba.analyse.textrank(raw_text)
* 实验结果:在 4 核 3.4GHz Linux 机器上,对金庸全集进行精确分词,获得了 1MB/s 的速度,是单进程版的 3.3 倍。
* **注意**:并行分词仅支持默认分词器 `jieba.dt``jieba.posseg.dt`
6) : Tokenize返回词语在原文的起始位置
6. Tokenize返回词语在原文的起止位置
----------------------------------
* 注意,输入参数只接受 unicode
* 默认模式
@ -235,7 +236,7 @@ word 有限公司 start: 6 end:10
* 搜索模式
```python
result = jieba.tokenize(u'永和服装饰品有限公司',mode='search')
result = jieba.tokenize(u'永和服装饰品有限公司', mode='search')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
```
@ -250,15 +251,15 @@ word 有限公司 start: 6 end:10
```
7) : ChineseAnalyzer for Whoosh 搜索引擎
7. ChineseAnalyzer for Whoosh 搜索引擎
--------------------------------------------
* 引用: `from jieba.analyse import ChineseAnalyzer`
* 用法示例https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
8) : 命令行分词
8. 命令行分词
-------------------
使用示例:`cat news.txt | python -m jieba > cut_result.txt`
使用示例:`python -m jieba news.txt > cut_result.txt`
命令行选项(翻译):
@ -310,10 +311,10 @@ word 有限公司 start: 6 end:10
If no filename specified, use STDIN instead.
模块初始化机制的改变:lazy load 从0.28版本开始)
-------------------------------------------
延迟加载机制
------------
jieba 采用延迟加载,"import jieba" 不会立即触发词典的加载,一旦有必要才开始加载词典构建前缀字典。如果你想手工初始 jieba也可以手动初始化。
jieba 采用延迟加载,`import jieba``jieba.Tokenizer()` 不会立即触发词典的加载,一旦有必要才开始加载词典构建前缀字典。如果你想手工初始 jieba也可以手动初始化。
import jieba
jieba.initialize() # 手动初始化(可选)
@ -460,12 +461,15 @@ Algorithm
Main Functions
==============
1) : Cut
1. Cut
--------
* The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
* `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
* The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
* `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode), or `list(jieba.cut( ... ))` to create a list.
* `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode).
* `jieba.lcut` and `jieba.lcut_for_search` returns a list.
* `jieba.Tokenizer(dictionary=DEFAULT_DICT)` creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. `jieba.dt` is the default Tokenizer, to which almost all global functions are mapped.
**Code example: segmentation**
@ -497,7 +501,7 @@ Output:
[Search Engine Mode] 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
2) : Add a custom dictionary
2. Add a custom dictionary
----------------------------
### Load dictionary
@ -505,6 +509,9 @@ Output:
* Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but adding your own new words can ensure a higher accuracy.
* Usage `jieba.load_userdict(file_name) # file_name is the path of the custom dictionary`
* The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space
* The word frequency can be omitted, then a calculated value will be used.
* Change a Tokenizer's `tmp_dir` and `cache_file` to specify the path of the cache file, for using on a restricted file system.
* Example
云计算 5
@ -540,12 +547,16 @@ Example:
「/台中/」/正确/应该/不会/被/切开
```
3) : Keyword Extraction
3. Keyword Extraction
-----------------------
* `jieba.analyse.extract_tags(sentence,topK,withWeight) # needs to first import jieba.analyse`
* `sentence`: the text to be extracted
* `topK`: return how many keywords with the highest TF/IDF weights. The default value is 20
* `withWeight`: whether return TF/IDF weights with the keywords. The default value is False
`import jieba.analyse`
* `jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())`
* `sentence`: the text to be extracted
* `topK`: return how many keywords with the highest TF/IDF weights. The default value is 20
* `withWeight`: whether return TF/IDF weights with the keywords. The default value is False
* `allowPOS`: filter words with which POSs are included. Empty for no filtering.
* `jieba.analyse.TFIDF(idf_path=None)` creates a new TFIDF instance, `idf_path` specifies IDF file path.
Example (keyword extraction)
@ -565,10 +576,15 @@ Developers can specify their own custom stop words corpus in jieba keyword extra
There's also a [TextRank](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) implementation available.
Use: `jieba.analyse.textrank(raw_text)`.
Use: `jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))`
4) : Part of Speech Tagging
-----------
Note that it filters POS by default.
`jieba.analyse.TextRank()` creates a new TextRank instance.
4. Part of Speech Tagging
-------------------------
* `jieba.posseg.POSTokenizer(tokenizer=None)` creates a new customized Tokenizer. `tokenizer` specifies the jieba.Tokenizer to internally use. `jieba.posseg.dt` is the default POSTokenizer.
* Tags the POS of each word after segmentation, using labels compatible with ictclas.
* Example:
@ -584,8 +600,8 @@ Use: `jieba.analyse.textrank(raw_text)`.
天安门 ns
```
5) : Parallel Processing
-----------
5. Parallel Processing
----------------------
* Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
* Based on the multiprocessing module of Python.
* Usage:
@ -597,8 +613,10 @@ Use: `jieba.analyse.textrank(raw_text)`.
* Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
6) : Tokenize: return words with position
----------------------------------
* **Note** that parallel processing supports only default tokenizers, `jieba.dt` and `jieba.posseg.dt`.
6. Tokenize: return words with position
----------------------------------------
* The input must be unicode
* Default mode
@ -634,13 +652,13 @@ word 有限公司 start: 6 end:10
```
7) : ChineseAnalyzer for Whoosh
--------------------------------------------
7. ChineseAnalyzer for Whoosh
-------------------------------
* `from jieba.analyse import ChineseAnalyzer`
* Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
8) : Command Line Interface
-------------------
8. Command Line Interface
--------------------------------
$> python -m jieba --help
usage: python -m jieba [options] filename
@ -679,7 +697,8 @@ You can also specify the dictionary (not supported before version 0.28) :
Using Other Dictionaries
========
===========================
It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:
1. A smaller dictionary for a smaller memory footprint:

File diff suppressed because it is too large Load Diff

View File

@ -1,103 +1,18 @@
#encoding=utf-8
from __future__ import absolute_import
import jieba
import jieba.posseg
import os
from operator import itemgetter
from .textrank import textrank
from .tfidf import TFIDF
from .textrank import TextRank
try:
from .analyzer import ChineseAnalyzer
except ImportError:
pass
_curpath = os.path.normpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
abs_path = os.path.join(_curpath, "idf.txt")
default_tfidf = TFIDF()
default_textrank = TextRank()
STOP_WORDS = set((
"the","of","is","and","to","in","that","we","for","an","are",
"by","be","as","on","with","can","if","from","which","you","it",
"this","then","at","have","all","not","one","has","or","that"
))
class IDFLoader:
def __init__(self):
self.path = ""
self.idf_freq = {}
self.median_idf = 0.0
def set_new_path(self, new_idf_path):
if self.path != new_idf_path:
content = open(new_idf_path, 'rb').read().decode('utf-8')
idf_freq = {}
lines = content.rstrip('\n').split('\n')
for line in lines:
word, freq = line.split(' ')
idf_freq[word] = float(freq)
median_idf = sorted(idf_freq.values())[len(idf_freq)//2]
self.idf_freq = idf_freq
self.median_idf = median_idf
self.path = new_idf_path
def get_idf(self):
return self.idf_freq, self.median_idf
idf_loader = IDFLoader()
idf_loader.set_new_path(abs_path)
def set_idf_path(idf_path):
new_abs_path = os.path.normpath(os.path.join(os.getcwd(), idf_path))
if not os.path.exists(new_abs_path):
raise Exception("jieba: path does not exist: " + new_abs_path)
idf_loader.set_new_path(new_abs_path)
extract_tags = tfidf = default_tfidf.extract_tags
set_idf_path = default_tfidf.set_idf_path
textrank = default_textrank.extract_tags
def set_stop_words(stop_words_path):
global STOP_WORDS
abs_path = os.path.normpath(os.path.join(os.getcwd(), stop_words_path))
if not os.path.exists(abs_path):
raise Exception("jieba: path does not exist: " + abs_path)
content = open(abs_path,'rb').read().decode('utf-8')
lines = content.replace("\r", "").split('\n')
for line in lines:
STOP_WORDS.add(line)
def extract_tags(sentence, topK=20, withWeight=False, allowPOS=[]):
"""
Extract keywords from sentence using TF-IDF algorithm.
Parameter:
- topK: return how many top keywords. `None` for all possible words.
- withWeight: if True, return a list of (word, weight);
if False, return a list of words.
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
if the POS of w is not in this list,it will be filtered.
"""
global STOP_WORDS, idf_loader
idf_freq, median_idf = idf_loader.get_idf()
if allowPOS:
allowPOS = frozenset(allowPOS)
words = jieba.posseg.cut(sentence)
else:
words = jieba.cut(sentence)
freq = {}
for w in words:
if allowPOS:
if w.flag not in allowPOS:
continue
else:
w = w.word
if len(w.strip()) < 2 or w.lower() in STOP_WORDS:
continue
freq[w] = freq.get(w, 0.0) + 1.0
total = sum(freq.values())
for k in freq:
freq[k] *= idf_freq.get(k, median_idf) / total
if withWeight:
tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
else:
tags = sorted(freq, key=freq.__getitem__, reverse=True)
if topK:
return tags[:topK]
else:
return tags
default_tfidf.set_stop_words(stop_words_path)
default_textrank.set_stop_words(stop_words_path)

View File

@ -1,7 +1,7 @@
#encoding=utf-8
# encoding=utf-8
from __future__ import unicode_literals
from whoosh.analysis import RegexAnalyzer,LowercaseFilter,StopFilter,StemFilter
from whoosh.analysis import Tokenizer,Token
from whoosh.analysis import RegexAnalyzer, LowercaseFilter, StopFilter, StemFilter
from whoosh.analysis import Tokenizer, Token
from whoosh.lang.porter import stem
import jieba
@ -15,12 +15,14 @@ STOP_WORDS = frozenset(('a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'can',
accepted_chars = re.compile(r"[\u4E00-\u9FA5]+")
class ChineseTokenizer(Tokenizer):
def __call__(self, text, **kargs):
words = jieba.tokenize(text, mode="search")
token = Token()
for (w,start_pos,stop_pos) in words:
if not accepted_chars.match(w) and len(w)<=1:
for (w, start_pos, stop_pos) in words:
if not accepted_chars.match(w) and len(w) <= 1:
continue
token.original = token.text = w
token.pos = start_pos
@ -28,7 +30,8 @@ class ChineseTokenizer(Tokenizer):
token.endchar = stop_pos
yield token
def ChineseAnalyzer(stoplist=STOP_WORDS, minsize=1, stemfn=stem, cachesize=50000):
return (ChineseTokenizer() | LowercaseFilter() |
StopFilter(stoplist=stoplist,minsize=minsize) |
StemFilter(stemfn=stemfn, ignore=None,cachesize=cachesize))
StopFilter(stoplist=stoplist, minsize=minsize) |
StemFilter(stemfn=stemfn, ignore=None, cachesize=cachesize))

View File

@ -3,9 +3,10 @@
from __future__ import absolute_import, unicode_literals
import sys
import collections
from operator import itemgetter
import jieba.posseg as pseg
from collections import defaultdict
import jieba.posseg
from .tfidf import KeywordExtractor
from .._compat import *
@ -13,7 +14,7 @@ class UndirectWeightedGraph:
d = 0.85
def __init__(self):
self.graph = collections.defaultdict(list)
self.graph = defaultdict(list)
def addEdge(self, start, end, weight):
# use a tuple (start, end, weight) instead of a Edge object
@ -21,8 +22,8 @@ class UndirectWeightedGraph:
self.graph[end].append((end, start, weight))
def rank(self):
ws = collections.defaultdict(float)
outSum = collections.defaultdict(float)
ws = defaultdict(float)
outSum = defaultdict(float)
wsdef = 1.0 / (len(self.graph) or 1.0)
for n, out in self.graph.items():
@ -53,43 +54,51 @@ class UndirectWeightedGraph:
return ws
def textrank(sentence, topK=10, withWeight=False, allowPOS=['ns', 'n', 'vn', 'v']):
"""
Extract keywords from sentence using TextRank algorithm.
Parameter:
- topK: return how many top keywords. `None` for all possible words.
- withWeight: if True, return a list of (word, weight);
if False, return a list of words.
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
if the POS of w is not in this list,it will be filtered.
"""
pos_filt = frozenset(allowPOS)
g = UndirectWeightedGraph()
cm = collections.defaultdict(int)
span = 5
words = list(pseg.cut(sentence))
for i in xrange(len(words)):
if words[i].flag in pos_filt:
for j in xrange(i + 1, i + span):
if j >= len(words):
break
if words[j].flag not in pos_filt:
continue
cm[(words[i].word, words[j].word)] += 1
class TextRank(KeywordExtractor):
for terms, w in cm.items():
g.addEdge(terms[0], terms[1], w)
nodes_rank = g.rank()
if withWeight:
tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True)
else:
tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)
if topK:
return tags[:topK]
else:
return tags
def __init__(self):
self.tokenizer = self.postokenizer = jieba.posseg.dt
self.stop_words = self.STOP_WORDS.copy()
self.pos_filt = frozenset(('ns', 'n', 'vn', 'v'))
self.span = 5
if __name__ == '__main__':
s = "此外公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元增资后吉林欧亚置业注册资本由7000万元增加到5亿元。吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城市商业综合体项目。2013年实现营业收入0万元实现净利润-139.13万元。"
for x, w in textrank(s, withWeight=True):
print('%s %s' % (x, w))
def pairfilter(self, wp):
return (wp.flag in self.pos_filt and len(wp.word.strip()) >= 2
and wp.word.lower() not in self.stop_words)
def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')):
"""
Extract keywords from sentence using TextRank algorithm.
Parameter:
- topK: return how many top keywords. `None` for all possible words.
- withWeight: if True, return a list of (word, weight);
if False, return a list of words.
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
if the POS of w is not in this list, it will be filtered.
"""
self.pos_filt = frozenset(allowPOS)
g = UndirectWeightedGraph()
cm = defaultdict(int)
words = tuple(self.tokenizer.cut(sentence))
for i, wp in enumerate(words):
if self.pairfilter(wp):
for j in xrange(i + 1, i + self.span):
if j >= len(words):
break
if not self.pairfilter(words[j]):
continue
cm[(wp.word, words[j].word)] += 1
for terms, w in cm.items():
g.addEdge(terms[0], terms[1], w)
nodes_rank = g.rank()
if withWeight:
tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True)
else:
tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)
if topK:
return tags[:topK]
else:
return tags
extract_tags = textrank

111
jieba/analyse/tfidf.py Executable file
View File

@ -0,0 +1,111 @@
# encoding=utf-8
from __future__ import absolute_import
import os
import jieba
import jieba.posseg
from operator import itemgetter
_get_module_path = lambda path: os.path.normpath(os.path.join(os.getcwd(),
os.path.dirname(__file__), path))
_get_abs_path = jieba._get_abs_path
DEFAULT_IDF = _get_module_path("idf.txt")
class KeywordExtractor(object):
STOP_WORDS = set((
"the", "of", "is", "and", "to", "in", "that", "we", "for", "an", "are",
"by", "be", "as", "on", "with", "can", "if", "from", "which", "you", "it",
"this", "then", "at", "have", "all", "not", "one", "has", "or", "that"
))
def set_stop_words(self, stop_words_path):
abs_path = _get_abs_path(stop_words_path)
if not os.path.isfile(abs_path):
raise Exception("jieba: file does not exist: " + abs_path)
content = open(abs_path, 'rb').read().decode('utf-8')
for line in content.splitlines():
self.stop_words.add(line)
def extract_tags(self, *args, **kwargs):
raise NotImplementedError
class IDFLoader(object):
def __init__(self, idf_path=None):
self.path = ""
self.idf_freq = {}
self.median_idf = 0.0
if idf_path:
self.set_new_path(idf_path)
def set_new_path(self, new_idf_path):
if self.path != new_idf_path:
self.path = new_idf_path
content = open(new_idf_path, 'rb').read().decode('utf-8')
self.idf_freq = {}
for line in content.splitlines():
word, freq = line.strip().split(' ')
self.idf_freq[word] = float(freq)
self.median_idf = sorted(
self.idf_freq.values())[len(self.idf_freq) // 2]
def get_idf(self):
return self.idf_freq, self.median_idf
class TFIDF(KeywordExtractor):
def __init__(self, idf_path=None):
self.tokenizer = jieba.dt
self.postokenizer = jieba.posseg.dt
self.stop_words = self.STOP_WORDS.copy()
self.idf_loader = IDFLoader(idf_path or DEFAULT_IDF)
self.idf_freq, self.median_idf = self.idf_loader.get_idf()
def set_idf_path(self, idf_path):
new_abs_path = _get_abs_path(idf_path)
if not os.path.isfile(new_abs_path):
raise Exception("jieba: file does not exist: " + new_abs_path)
self.idf_loader.set_new_path(new_abs_path)
self.idf_freq, self.median_idf = self.idf_loader.get_idf()
def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=()):
"""
Extract keywords from sentence using TF-IDF algorithm.
Parameter:
- topK: return how many top keywords. `None` for all possible words.
- withWeight: if True, return a list of (word, weight);
if False, return a list of words.
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
if the POS of w is not in this list,it will be filtered.
"""
if allowPOS:
allowPOS = frozenset(allowPOS)
words = self.postokenizer.cut(sentence)
else:
words = self.tokenizer.cut(sentence)
freq = {}
for w in words:
if allowPOS:
if w.flag not in allowPOS:
continue
else:
w = w.word
if len(w.strip()) < 2 or w.lower() in self.stop_words:
continue
freq[w] = freq.get(w, 0.0) + 1.0
total = sum(freq.values())
for k in freq:
freq[k] *= self.idf_freq.get(k, self.median_idf) / total
if withWeight:
tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
else:
tags = sorted(freq, key=freq.__getitem__, reverse=True)
if topK:
return tags[:topK]
else:
return tags

View File

@ -1,10 +1,9 @@
from __future__ import absolute_import, unicode_literals
import re
import os
import jieba
import re
import sys
import jieba
import marshal
from functools import wraps
from .._compat import *
from .viterbi import viterbi
@ -24,23 +23,10 @@ re_num = re.compile("[\.0-9]+")
re_eng1 = re.compile('^[a-zA-Z0-9]$', re.U)
def load_model(f_name, isJython=True):
def load_model(f_name):
_curpath = os.path.normpath(
os.path.join(os.getcwd(), os.path.dirname(__file__)))
result = {}
with open(f_name, "rb") as f:
for line in f:
line = line.strip()
if not line:
continue
line = line.decode("utf-8")
word, _, tag = line.split(" ")
result[word] = tag
if not isJython:
return result
# For Jython
start_p = {}
abs_path = os.path.join(_curpath, PROB_START_P)
with open(abs_path, 'rb') as f:
@ -64,29 +50,15 @@ def load_model(f_name, isJython=True):
return state, start_p, trans_p, emit_p, result
if sys.platform.startswith("java"):
char_state_tab_P, start_P, trans_P, emit_P, word_tag_tab = load_model(
jieba.get_abs_path_dict())
char_state_tab_P, start_P, trans_P, emit_P, word_tag_tab = load_model()
else:
from .char_state_tab import P as char_state_tab_P
from .prob_start import P as start_P
from .prob_trans import P as trans_P
from .prob_emit import P as emit_P
word_tag_tab = load_model(jieba.get_abs_path_dict(), isJython=False)
def makesure_userdict_loaded(fn):
@wraps(fn)
def wrapped(*args, **kwargs):
if jieba.user_word_tag_tab:
word_tag_tab.update(jieba.user_word_tag_tab)
jieba.user_word_tag_tab = {}
return fn(*args, **kwargs)
return wrapped
class pair(object):
@ -110,154 +82,220 @@ class pair(object):
return self.__unicode__().encode(arg)
def __cut(sentence):
prob, pos_list = viterbi(
sentence, char_state_tab_P, start_P, trans_P, emit_P)
begin, nexti = 0, 0
class POSTokenizer(object):
for i, char in enumerate(sentence):
pos = pos_list[i][0]
if pos == 'B':
begin = i
elif pos == 'E':
yield pair(sentence[begin:i + 1], pos_list[i][1])
nexti = i + 1
elif pos == 'S':
yield pair(char, pos_list[i][1])
nexti = i + 1
if nexti < len(sentence):
yield pair(sentence[nexti:], pos_list[nexti][1])
def __init__(self, tokenizer=None):
self.tokenizer = tokenizer or jieba.Tokenizer()
self.load_word_tag(self.tokenizer.get_abs_path_dict())
def __repr__(self):
return '<POSTokenizer tokenizer=%r>' % self.tokenizer
def __cut_detail(sentence):
blocks = re_han_detail.split(sentence)
for blk in blocks:
if re_han_detail.match(blk):
for word in __cut(blk):
yield word
else:
tmp = re_skip_detail.split(blk)
for x in tmp:
if x:
if re_num.match(x):
yield pair(x, 'm')
elif re_eng.match(x):
yield pair(x, 'eng')
else:
yield pair(x, 'x')
def __getattr__(self, name):
if name in ('cut_for_search', 'lcut_for_search', 'tokenize'):
# may be possible?
raise NotImplementedError
return getattr(self.tokenizer, name)
def initialize(self, dictionary=None):
self.tokenizer.initialize(dictionary)
self.load_word_tag(self.tokenizer.get_abs_path_dict())
def __cut_DAG_NO_HMM(sentence):
DAG = jieba.get_DAG(sentence)
route = {}
jieba.calc(sentence, DAG, route)
x = 0
N = len(sentence)
buf = ''
while x < N:
y = route[x][1] + 1
l_word = sentence[x:y]
if re_eng1.match(l_word):
buf += l_word
x = y
else:
if buf:
yield pair(buf, 'eng')
buf = ''
yield pair(l_word, word_tag_tab.get(l_word, 'x'))
x = y
if buf:
yield pair(buf, 'eng')
buf = ''
def load_word_tag(self, f_name):
self.word_tag_tab = {}
with open(f_name, "rb") as f:
for lineno, line in enumerate(f, 1):
try:
line = line.strip().decode("utf-8")
if not line:
continue
word, _, tag = line.split(" ")
self.word_tag_tab[word] = tag
except Exception:
raise ValueError(
'invalid POS dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
def makesure_userdict_loaded(self):
if self.tokenizer.user_word_tag_tab:
self.word_tag_tab.update(self.tokenizer.user_word_tag_tab)
self.tokenizer.user_word_tag_tab = {}
def __cut_DAG(sentence):
DAG = jieba.get_DAG(sentence)
route = {}
def __cut(self, sentence):
prob, pos_list = viterbi(
sentence, char_state_tab_P, start_P, trans_P, emit_P)
begin, nexti = 0, 0
jieba.calc(sentence, DAG, route)
for i, char in enumerate(sentence):
pos = pos_list[i][0]
if pos == 'B':
begin = i
elif pos == 'E':
yield pair(sentence[begin:i + 1], pos_list[i][1])
nexti = i + 1
elif pos == 'S':
yield pair(char, pos_list[i][1])
nexti = i + 1
if nexti < len(sentence):
yield pair(sentence[nexti:], pos_list[nexti][1])
x = 0
buf = ''
N = len(sentence)
while x < N:
y = route[x][1] + 1
l_word = sentence[x:y]
if y - x == 1:
buf += l_word
else:
if buf:
if len(buf) == 1:
yield pair(buf, word_tag_tab.get(buf, 'x'))
elif not jieba.FREQ.get(buf):
recognized = __cut_detail(buf)
for t in recognized:
yield t
else:
for elem in buf:
yield pair(elem, word_tag_tab.get(elem, 'x'))
buf = ''
yield pair(l_word, word_tag_tab.get(l_word, 'x'))
x = y
if buf:
if len(buf) == 1:
yield pair(buf, word_tag_tab.get(buf, 'x'))
elif not jieba.FREQ.get(buf):
recognized = __cut_detail(buf)
for t in recognized:
yield t
else:
for elem in buf:
yield pair(elem, word_tag_tab.get(elem, 'x'))
def __cut_internal(sentence, HMM=True):
sentence = strdecode(sentence)
blocks = re_han_internal.split(sentence)
if HMM:
__cut_blk = __cut_DAG
else:
__cut_blk = __cut_DAG_NO_HMM
for blk in blocks:
if re_han_internal.match(blk):
for word in __cut_blk(blk):
yield word
else:
tmp = re_skip_internal.split(blk)
for x in tmp:
if re_skip_internal.match(x):
yield pair(x, 'x')
else:
for xx in x:
if re_num.match(xx):
yield pair(xx, 'm')
def __cut_detail(self, sentence):
blocks = re_han_detail.split(sentence)
for blk in blocks:
if re_han_detail.match(blk):
for word in self.__cut(blk):
yield word
else:
tmp = re_skip_detail.split(blk)
for x in tmp:
if x:
if re_num.match(x):
yield pair(x, 'm')
elif re_eng.match(x):
yield pair(xx, 'eng')
yield pair(x, 'eng')
else:
yield pair(xx, 'x')
yield pair(x, 'x')
def __cut_DAG_NO_HMM(self, sentence):
DAG = self.tokenizer.get_DAG(sentence)
route = {}
self.tokenizer.calc(sentence, DAG, route)
x = 0
N = len(sentence)
buf = ''
while x < N:
y = route[x][1] + 1
l_word = sentence[x:y]
if re_eng1.match(l_word):
buf += l_word
x = y
else:
if buf:
yield pair(buf, 'eng')
buf = ''
yield pair(l_word, self.word_tag_tab.get(l_word, 'x'))
x = y
if buf:
yield pair(buf, 'eng')
buf = ''
def __cut_DAG(self, sentence):
DAG = self.tokenizer.get_DAG(sentence)
route = {}
self.tokenizer.calc(sentence, DAG, route)
x = 0
buf = ''
N = len(sentence)
while x < N:
y = route[x][1] + 1
l_word = sentence[x:y]
if y - x == 1:
buf += l_word
else:
if buf:
if len(buf) == 1:
yield pair(buf, self.word_tag_tab.get(buf, 'x'))
elif not self.tokenizer.FREQ.get(buf):
recognized = self.__cut_detail(buf)
for t in recognized:
yield t
else:
for elem in buf:
yield pair(elem, self.word_tag_tab.get(elem, 'x'))
buf = ''
yield pair(l_word, self.word_tag_tab.get(l_word, 'x'))
x = y
if buf:
if len(buf) == 1:
yield pair(buf, self.word_tag_tab.get(buf, 'x'))
elif not self.tokenizer.FREQ.get(buf):
recognized = self.__cut_detail(buf)
for t in recognized:
yield t
else:
for elem in buf:
yield pair(elem, self.word_tag_tab.get(elem, 'x'))
def __cut_internal(self, sentence, HMM=True):
self.makesure_userdict_loaded()
sentence = strdecode(sentence)
blocks = re_han_internal.split(sentence)
if HMM:
cut_blk = self.__cut_DAG
else:
cut_blk = self.__cut_DAG_NO_HMM
for blk in blocks:
if re_han_internal.match(blk):
for word in cut_blk(blk):
yield word
else:
tmp = re_skip_internal.split(blk)
for x in tmp:
if re_skip_internal.match(x):
yield pair(x, 'x')
else:
for xx in x:
if re_num.match(xx):
yield pair(xx, 'm')
elif re_eng.match(x):
yield pair(xx, 'eng')
else:
yield pair(xx, 'x')
def _lcut_internal(self, sentence):
return list(self.__cut_internal(sentence))
def _lcut_internal_no_hmm(self, sentence):
return list(self.__cut_internal(sentence, False))
def cut(self, sentence, HMM=True):
for w in self.__cut_internal(sentence, HMM=HMM):
yield w
def lcut(self, *args, **kwargs):
return list(self.cut(*args, **kwargs))
# default Tokenizer instance
dt = POSTokenizer(jieba.dt)
# global functions
initialize = dt.initialize
def __lcut_internal(sentence):
return list(__cut_internal(sentence))
def _lcut_internal(s):
return dt._lcut_internal(s)
def __lcut_internal_no_hmm(sentence):
return list(__cut_internal(sentence, False))
def _lcut_internal_no_hmm(s):
return dt._lcut_internal_no_hmm(s)
@makesure_userdict_loaded
def cut(sentence, HMM=True):
"""
Global `cut` function that supports parallel processing.
Note that this only works using dt, custom POSTokenizer
instances are not supported.
"""
global dt
if jieba.pool is None:
for w in __cut_internal(sentence, HMM=HMM):
for w in dt.cut(sentence, HMM=HMM):
yield w
else:
parts = strdecode(sentence).splitlines(True)
if HMM:
result = jieba.pool.map(__lcut_internal, parts)
result = jieba.pool.map(_lcut_internal, parts)
else:
result = jieba.pool.map(__lcut_internal_no_hmm, parts)
result = jieba.pool.map(_lcut_internal_no_hmm, parts)
for r in result:
for w in r:
yield w
def lcut(sentence, HMM=True):
return list(cut(sentence, HMM))

View File

@ -4,6 +4,12 @@ import sys
sys.path.append("../")
import jieba
import jieba.posseg
import jieba.analyse
print('='*40)
print('1. 分词')
print('-'*40)
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) # 全模式
@ -16,3 +22,63 @@ print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))
print('='*40)
print('2. 添加自定义词典/调整词典')
print('-'*40)
print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
#如果/放到/post/中将/出错/。
print(jieba.suggest_freq(('', ''), True))
#494
print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
#如果/放到/post/中/将/出错/。
print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
#「/台/中/」/正确/应该/不会/被/切开
print(jieba.suggest_freq('台中', True))
#69
print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
#「/台中/」/正确/应该/不会/被/切开
print('='*40)
print('3. 关键词提取')
print('-'*40)
print(' TF-IDF')
print('-'*40)
s = "此外公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元增资后吉林欧亚置业注册资本由7000万元增加到5亿元。吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城市商业综合体项目。2013年实现营业收入0万元实现净利润-139.13万元。"
for x, w in jieba.analyse.extract_tags(s, withWeight=True):
print('%s %s' % (x, w))
print('-'*40)
print(' TextRank')
print('-'*40)
for x, w in jieba.analyse.textrank(s, withWeight=True):
print('%s %s' % (x, w))
print('='*40)
print('4. 词性标注')
print('-'*40)
words = jieba.posseg.cut("我爱北京天安门")
for w in words:
print('%s %s' % (w.word, w.flag))
print('='*40)
print('6. Tokenize: 返回词语在原文的起止位置')
print('-'*40)
print(' 默认模式')
print('-'*40)
result = jieba.tokenize('永和服装饰品有限公司')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
print('-'*40)
print(' 搜索模式')
print('-'*40)
result = jieba.tokenize('永和服装饰品有限公司', mode='search')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

42
test/test_lock.py Normal file
View File

@ -0,0 +1,42 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import jieba
import threading
def inittokenizer(tokenizer, group):
print('===> Thread %s:%s started' % (group, threading.current_thread().ident))
tokenizer.initialize()
print('<=== Thread %s:%s finished' % (group, threading.current_thread().ident))
tokrs1 = [jieba.Tokenizer() for n in range(5)]
tokrs2 = [jieba.Tokenizer('../extra_dict/dict.txt.small') for n in range(5)]
thr1 = [threading.Thread(target=inittokenizer, args=(tokr, 1)) for tokr in tokrs1]
thr2 = [threading.Thread(target=inittokenizer, args=(tokr, 2)) for tokr in tokrs2]
for thr in thr1:
thr.start()
for thr in thr2:
thr.start()
for thr in thr1:
thr.join()
for thr in thr2:
thr.join()
del tokrs1, tokrs2
print('='*40)
tokr1 = jieba.Tokenizer()
tokr2 = jieba.Tokenizer('../extra_dict/dict.txt.small')
thr1 = [threading.Thread(target=inittokenizer, args=(tokr1, 1)) for n in range(5)]
thr2 = [threading.Thread(target=inittokenizer, args=(tokr2, 2)) for n in range(5)]
for thr in thr1:
thr.start()
for thr in thr2:
thr.start()
for thr in thr1:
thr.join()
for thr in thr2:
thr.join()