diff --git a/README.md b/README.md index fd31119..8c0d764 100644 --- a/README.md +++ b/README.md @@ -45,17 +45,19 @@ http://jiebademo.ap01.aws.af.cm/ 主要功能 ======= -1) :分词 +1. 分词 -------- * `jieba.cut` 方法接受三个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用 HMM 模型 * `jieba.cut_for_search` 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细 * 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意:不建议直接输入 GBK 字符串,可能无法预料地错误解码成 UTF-8 -* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator,可以使用 for 循环来获得分词后得到的每一个词语(unicode),也可以用 list(jieba.cut(...)) 转化为 list +* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator,可以使用 for 循环来获得分词后得到的每一个词语(unicode),或者用 +* `jieba.lcut` 以及 `jieba.lcut_for_search` 直接返回 list +* `jieba.Tokenizer(dictionary=DEFAULT_DICT)` 新建自定义分词器,可用于同时使用不同词典。`jieba.dt` 为默认分词器,所有全局分词相关函数都是该分词器的映射。 -代码示例( 分词 ) +代码示例 ```python -#encoding=utf-8 +# encoding=utf-8 import jieba seg_list = jieba.cut("我来到北京清华大学", cut_all=True) @@ -81,7 +83,7 @@ print(", ".join(seg_list)) 【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造 -2) :添加自定义词典 +2. 添加自定义词典 ---------------- ### 载入词典 @@ -91,6 +93,8 @@ print(", ".join(seg_list)) * 词典格式和`dict.txt`一样,一个词占一行;每一行分三部分,一部分为词语,另一部分为词频(可省略),最后为词性(可省略),用空格隔开 * 词频可省略,使用计算出的能保证分出该词的词频 +* 更改分词器的 tmp_dir 和 cache_file 属性,可指定缓存文件位置,用于受限的文件系统。 + * 范例: * 自定义词典:https://github.com/fxsjy/jieba/blob/master/test/userdict.txt @@ -128,12 +132,18 @@ print(", ".join(seg_list)) * "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14 -3) :关键词提取 +3. 关键词提取 ------------- -* jieba.analyse.extract_tags(sentence,topK,withWeight) #需要先 `import jieba.analyse` -* sentence 为待提取的文本 -* topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20 -* withWeight 为是否一并返回关键词权重值,默认值为 False +### 基于 TF-IDF 算法的关键词抽取 + +`import jieba.analyse` + +* jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=()) + * sentence 为待提取的文本 + * topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20 + * withWeight 为是否一并返回关键词权重值,默认值为 False + * allowPOS 仅包括指定词性的词,默认值为空,即不筛选 +* jieba.analyse.TFIDF(idf_path=None) 新建 TFIDF 实例,idf_path 为 IDF 频率文件 代码示例 (关键词提取) @@ -155,37 +165,27 @@ https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py * 用法示例:https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py -#### 基于TextRank算法的关键词抽取实现 +### 基于 TextRank 算法的关键词抽取 + +* jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')) 直接使用,接口相同,注意默认过滤词性。 +* jieba.analyse.TextRank() 新建自定义 TextRank 实例 + 算法论文: [TextRank: Bringing Order into Texts](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -##### 基本思想: +#### 基本思想: 1. 将待抽取关键词的文本进行分词 -2. 以固定窗口大小(我选的5,可适当调整),词之间的共现关系,构建图 +2. 以固定窗口大小(默认为5,通过span属性调整),词之间的共现关系,构建图 3. 计算图中节点的PageRank,注意是无向带权图 -##### 基本使用: -jieba.analyse.textrank(raw_text) +#### 使用示例: -##### 示例结果: -来自`__main__`的示例结果: +见 [test/demo.py](https://github.com/fxsjy/jieba/blob/master/test/demo.py) -``` -吉林 1.0 -欧亚 0.864834432786 -置业 0.553465925497 -实现 0.520660869531 -收入 0.379699688954 -增资 0.355086023683 -子公司 0.349758490263 -全资 0.308537396283 -城市 0.306103738053 -商业 0.304837414946 -``` - -4) : 词性标注 +4. 词性标注 ----------- -* 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法 +* `jieba.posseg.POSTokenizer(tokenizer=None)` 新建自定义分词器,`tokenizer` 参数可指定内部使用的 `jieba.Tokenizer` 分词器。`jieba.posseg.dt` 为默认词性标注分词器。 +* 标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法。 * 用法示例 ```pycon @@ -200,10 +200,10 @@ jieba.analyse.textrank(raw_text) 天安门 ns ``` -5) : 并行分词 +5. 并行分词 ----------- -* 原理:将目标文本按行分隔后,把各行文本分配到多个 python 进程并行分词,然后归并结果,从而获得分词速度的可观提升 -* 基于 python 自带的 multiprocessing 模块,目前暂不支持 windows +* 原理:将目标文本按行分隔后,把各行文本分配到多个 Python 进程并行分词,然后归并结果,从而获得分词速度的可观提升 +* 基于 python 自带的 multiprocessing 模块,目前暂不支持 Windows * 用法: * `jieba.enable_parallel(4)` # 开启并行分词模式,参数为并行进程数 * `jieba.disable_parallel()` # 关闭并行分词模式 @@ -212,8 +212,9 @@ jieba.analyse.textrank(raw_text) * 实验结果:在 4 核 3.4GHz Linux 机器上,对金庸全集进行精确分词,获得了 1MB/s 的速度,是单进程版的 3.3 倍。 +* **注意**:并行分词仅支持默认分词器 `jieba.dt` 和 `jieba.posseg.dt`。 -6) : Tokenize:返回词语在原文的起始位置 +6. Tokenize:返回词语在原文的起止位置 ---------------------------------- * 注意,输入参数只接受 unicode * 默认模式 @@ -235,7 +236,7 @@ word 有限公司 start: 6 end:10 * 搜索模式 ```python -result = jieba.tokenize(u'永和服装饰品有限公司',mode='search') +result = jieba.tokenize(u'永和服装饰品有限公司', mode='search') for tk in result: print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2])) ``` @@ -250,15 +251,15 @@ word 有限公司 start: 6 end:10 ``` -7) : ChineseAnalyzer for Whoosh 搜索引擎 +7. ChineseAnalyzer for Whoosh 搜索引擎 -------------------------------------------- * 引用: `from jieba.analyse import ChineseAnalyzer` * 用法示例:https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py -8) : 命令行分词 +8. 命令行分词 ------------------- -使用示例:`cat news.txt | python -m jieba > cut_result.txt` +使用示例:`python -m jieba news.txt > cut_result.txt` 命令行选项(翻译): @@ -310,10 +311,10 @@ word 有限公司 start: 6 end:10 If no filename specified, use STDIN instead. -模块初始化机制的改变:lazy load (从0.28版本开始) -------------------------------------------- +延迟加载机制 +------------ -jieba 采用延迟加载,"import jieba" 不会立即触发词典的加载,一旦有必要才开始加载词典构建前缀字典。如果你想手工初始 jieba,也可以手动初始化。 +jieba 采用延迟加载,`import jieba` 和 `jieba.Tokenizer()` 不会立即触发词典的加载,一旦有必要才开始加载词典构建前缀字典。如果你想手工初始 jieba,也可以手动初始化。 import jieba jieba.initialize() # 手动初始化(可选) @@ -460,12 +461,15 @@ Algorithm Main Functions ============== -1) : Cut +1. Cut -------- * The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model. * `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines. * The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8. -* `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode), or `list(jieba.cut( ... ))` to create a list. +* `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode). +* `jieba.lcut` and `jieba.lcut_for_search` returns a list. +* `jieba.Tokenizer(dictionary=DEFAULT_DICT)` creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. `jieba.dt` is the default Tokenizer, to which almost all global functions are mapped. + **Code example: segmentation** @@ -497,7 +501,7 @@ Output: [Search Engine Mode]: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造 -2) : Add a custom dictionary +2. Add a custom dictionary ---------------------------- ### Load dictionary @@ -505,6 +509,9 @@ Output: * Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but adding your own new words can ensure a higher accuracy. * Usage: `jieba.load_userdict(file_name) # file_name is the path of the custom dictionary` * The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space +* The word frequency can be omitted, then a calculated value will be used. +* Change a Tokenizer's `tmp_dir` and `cache_file` to specify the path of the cache file, for using on a restricted file system. + * Example: 云计算 5 @@ -540,12 +547,16 @@ Example: 「/台中/」/正确/应该/不会/被/切开 ``` -3) : Keyword Extraction +3. Keyword Extraction ----------------------- -* `jieba.analyse.extract_tags(sentence,topK,withWeight) # needs to first import jieba.analyse` -* `sentence`: the text to be extracted -* `topK`: return how many keywords with the highest TF/IDF weights. The default value is 20 -* `withWeight`: whether return TF/IDF weights with the keywords. The default value is False +`import jieba.analyse` + +* `jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())` + * `sentence`: the text to be extracted + * `topK`: return how many keywords with the highest TF/IDF weights. The default value is 20 + * `withWeight`: whether return TF/IDF weights with the keywords. The default value is False + * `allowPOS`: filter words with which POSs are included. Empty for no filtering. +* `jieba.analyse.TFIDF(idf_path=None)` creates a new TFIDF instance, `idf_path` specifies IDF file path. Example (keyword extraction) @@ -565,10 +576,15 @@ Developers can specify their own custom stop words corpus in jieba keyword extra There's also a [TextRank](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) implementation available. -Use: `jieba.analyse.textrank(raw_text)`. +Use: `jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))` -4) : Part of Speech Tagging ------------ +Note that it filters POS by default. + +`jieba.analyse.TextRank()` creates a new TextRank instance. + +4. Part of Speech Tagging +------------------------- +* `jieba.posseg.POSTokenizer(tokenizer=None)` creates a new customized Tokenizer. `tokenizer` specifies the jieba.Tokenizer to internally use. `jieba.posseg.dt` is the default POSTokenizer. * Tags the POS of each word after segmentation, using labels compatible with ictclas. * Example: @@ -584,8 +600,8 @@ Use: `jieba.analyse.textrank(raw_text)`. 天安门 ns ``` -5) : Parallel Processing ------------ +5. Parallel Processing +---------------------- * Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster. * Based on the multiprocessing module of Python. * Usage: @@ -597,8 +613,10 @@ Use: `jieba.analyse.textrank(raw_text)`. * Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version. -6) : Tokenize: return words with position ----------------------------------- +* **Note** that parallel processing supports only default tokenizers, `jieba.dt` and `jieba.posseg.dt`. + +6. Tokenize: return words with position +---------------------------------------- * The input must be unicode * Default mode @@ -634,13 +652,13 @@ word 有限公司 start: 6 end:10 ``` -7) : ChineseAnalyzer for Whoosh --------------------------------------------- +7. ChineseAnalyzer for Whoosh +------------------------------- * `from jieba.analyse import ChineseAnalyzer` * Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py -8) : Command Line Interface -------------------- +8. Command Line Interface +-------------------------------- $> python -m jieba --help usage: python -m jieba [options] filename @@ -679,7 +697,8 @@ You can also specify the dictionary (not supported before version 0.28) : Using Other Dictionaries -======== +=========================== + It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download: 1. A smaller dictionary for a smaller memory footprint: diff --git a/jieba/__init__.py b/jieba/__init__.py index d64d16c..a00ae52 100644 --- a/jieba/__init__.py +++ b/jieba/__init__.py @@ -6,503 +6,564 @@ import re import os import sys import time -import tempfile -import marshal -from math import log -import threading -from functools import wraps import logging +import marshal +import tempfile +import threading +from math import log from hashlib import md5 from ._compat import * from . import finalseg -DICTIONARY = "dict.txt" -DICT_LOCK = threading.RLock() -FREQ = {} # to be initialized -total = 0 -user_word_tag_tab = {} -initialized = False -pool = None -tmp_dir = None +if os.name == 'nt': + from shutil import move as _replace_file +else: + _replace_file = os.rename -_curpath = os.path.normpath( - os.path.join(os.getcwd(), os.path.dirname(__file__))) +_get_module_path = lambda path: os.path.normpath(os.path.join(os.getcwd(), + os.path.dirname(__file__), path)) +_get_abs_path = lambda path: os.path.normpath(os.path.join(os.getcwd(), path)) + +DEFAULT_DICT = _get_module_path("dict.txt") log_console = logging.StreamHandler(sys.stderr) -logger = logging.getLogger(__name__) -logger.setLevel(logging.DEBUG) -logger.addHandler(log_console) +default_logger = logging.getLogger(__name__) +default_logger.setLevel(logging.DEBUG) +default_logger.addHandler(log_console) +DICT_WRITING = {} -def setLogLevel(log_level): - global logger - logger.setLevel(log_level) - - -def gen_pfdict(f_name): - lfreq = {} - ltotal = 0 - with open(f_name, 'rb') as f: - lineno = 0 - for line in f.read().rstrip().decode('utf-8').splitlines(): - lineno += 1 - try: - word, freq = line.split(' ')[:2] - freq = int(freq) - lfreq[word] = freq - ltotal += freq - for ch in xrange(len(word)): - wfrag = word[:ch + 1] - if wfrag not in lfreq: - lfreq[wfrag] = 0 - except ValueError as e: - logger.debug('%s at line %s %s' % (f_name, lineno, line)) - raise e - return lfreq, ltotal - - -def initialize(dictionary=None): - global FREQ, total, initialized, DICTIONARY, DICT_LOCK, tmp_dir - if not dictionary: - dictionary = DICTIONARY - with DICT_LOCK: - if initialized: - return - - abs_path = os.path.join(_curpath, dictionary) - logger.debug("Building prefix dict from %s ..." % abs_path) - t1 = time.time() - # default dictionary - if abs_path == os.path.join(_curpath, "dict.txt"): - cache_file = os.path.join(tmp_dir if tmp_dir else tempfile.gettempdir(),"jieba.cache") - else: # custom dictionary - cache_file = os.path.join(tmp_dir if tmp_dir else tempfile.gettempdir(),"jieba.u%s.cache" % md5( - abs_path.encode('utf-8', 'replace')).hexdigest()) - - load_from_cache_fail = True - if os.path.isfile(cache_file) and os.path.getmtime(cache_file) > os.path.getmtime(abs_path): - logger.debug("Loading model from cache %s" % cache_file) - try: - with open(cache_file, 'rb') as cf: - FREQ, total = marshal.load(cf) - load_from_cache_fail = False - except Exception: - load_from_cache_fail = True - - if load_from_cache_fail: - FREQ, total = gen_pfdict(abs_path) - logger.debug("Dumping model to file cache %s" % cache_file) - try: - fd, fpath = tempfile.mkstemp() - with os.fdopen(fd, 'wb') as temp_cache_file: - marshal.dump((FREQ, total), temp_cache_file) - if os.name == 'nt': - from shutil import move as replace_file - else: - replace_file = os.rename - replace_file(fpath, cache_file) - except Exception: - logger.exception("Dump cache file failed.") - - initialized = True - - logger.debug("Loading model cost %s seconds." % (time.time() - t1)) - logger.debug("Prefix dict has been built succesfully.") - - -def require_initialized(fn): - - @wraps(fn) - def wrapped(*args, **kwargs): - global initialized - if initialized: - return fn(*args, **kwargs) - else: - initialize(DICTIONARY) - return fn(*args, **kwargs) - - return wrapped - - -def __cut_all(sentence): - dag = get_DAG(sentence) - old_j = -1 - for k, L in iteritems(dag): - if len(L) == 1 and k > old_j: - yield sentence[k:L[0] + 1] - old_j = L[0] - else: - for j in L: - if j > k: - yield sentence[k:j + 1] - old_j = j - - -def calc(sentence, DAG, route): - N = len(sentence) - route[N] = (0, 0) - logtotal = log(total) - for idx in xrange(N - 1, -1, -1): - route[idx] = max((log(FREQ.get(sentence[idx:x + 1]) or 1) - - logtotal + route[x + 1][0], x) for x in DAG[idx]) - - -@require_initialized -def get_DAG(sentence): - global FREQ - DAG = {} - N = len(sentence) - for k in xrange(N): - tmplist = [] - i = k - frag = sentence[k] - while i < N and frag in FREQ: - if FREQ[frag]: - tmplist.append(i) - i += 1 - frag = sentence[k:i + 1] - if not tmplist: - tmplist.append(k) - DAG[k] = tmplist - return DAG +pool = None re_eng = re.compile('[a-zA-Z0-9]', re.U) - -def __cut_DAG_NO_HMM(sentence): - DAG = get_DAG(sentence) - route = {} - calc(sentence, DAG, route) - x = 0 - N = len(sentence) - buf = '' - while x < N: - y = route[x][1] + 1 - l_word = sentence[x:y] - if re_eng.match(l_word) and len(l_word) == 1: - buf += l_word - x = y - else: - if buf: - yield buf - buf = '' - yield l_word - x = y - if buf: - yield buf - buf = '' - - -def __cut_DAG(sentence): - DAG = get_DAG(sentence) - route = {} - calc(sentence, DAG, route=route) - x = 0 - buf = '' - N = len(sentence) - while x < N: - y = route[x][1] + 1 - l_word = sentence[x:y] - if y - x == 1: - buf += l_word - else: - if buf: - if len(buf) == 1: - yield buf - buf = '' - else: - if not FREQ.get(buf): - recognized = finalseg.cut(buf) - for t in recognized: - yield t - else: - for elem in buf: - yield elem - buf = '' - yield l_word - x = y - - if buf: - if len(buf) == 1: - yield buf - elif not FREQ.get(buf): - recognized = finalseg.cut(buf) - for t in recognized: - yield t - else: - for elem in buf: - yield elem - +# \u4E00-\u9FA5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han +# \r\n|\s : whitespace characters. Will not be handled. re_han_default = re.compile("([\u4E00-\u9FA5a-zA-Z0-9+#&\._]+)", re.U) re_skip_default = re.compile("(\r\n|\s)", re.U) re_han_cut_all = re.compile("([\u4E00-\u9FA5]+)", re.U) re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U) +def setLogLevel(log_level): + global logger + default_logger.setLevel(log_level) -def cut(sentence, cut_all=False, HMM=True): - ''' - The main function that segments an entire sentence that contains - Chinese characters into seperated words. +class Tokenizer(object): - Parameter: - - sentence: The str(unicode) to be segmented. - - cut_all: Model type. True for full pattern, False for accurate pattern. - - HMM: Whether to use the Hidden Markov Model. - ''' - sentence = strdecode(sentence) + def __init__(self, dictionary=DEFAULT_DICT): + self.lock = threading.RLock() + self.dictionary = _get_abs_path(dictionary) + self.FREQ = {} + self.total = 0 + self.user_word_tag_tab = {} + self.initialized = False + self.tmp_dir = None + self.cache_file = None - # \u4E00-\u9FA5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han - # \r\n|\s : whitespace characters. Will not be handled. + def __repr__(self): + return '' % self.dictionary - if cut_all: - re_han = re_han_cut_all - re_skip = re_skip_cut_all - else: - re_han = re_han_default - re_skip = re_skip_default - blocks = re_han.split(sentence) - if cut_all: - cut_block = __cut_all - elif HMM: - cut_block = __cut_DAG - else: - cut_block = __cut_DAG_NO_HMM - for blk in blocks: - if not blk: - continue - if re_han.match(blk): - for word in cut_block(blk): - yield word + def gen_pfdict(self, f_name): + lfreq = {} + ltotal = 0 + with open(f_name, 'rb') as f: + for lineno, line in enumerate(f, 1): + try: + line = line.strip().decode('utf-8') + word, freq = line.split(' ')[:2] + freq = int(freq) + lfreq[word] = freq + ltotal += freq + for ch in xrange(len(word)): + wfrag = word[:ch + 1] + if wfrag not in lfreq: + lfreq[wfrag] = 0 + except ValueError: + raise ValueError( + 'invalid dictionary entry in %s at Line %s: %s' % (f_name, lineno, line)) + return lfreq, ltotal + + def initialize(self, dictionary=None): + if dictionary: + abs_path = _get_abs_path(dictionary) + if self.dictionary == abs_path and self.initialized: + return + else: + self.dictionary = abs_path + self.initialized = False else: - tmp = re_skip.split(blk) - for x in tmp: - if re_skip.match(x): - yield x - elif not cut_all: - for xx in x: - yield xx - else: - yield x + abs_path = self.dictionary + with self.lock: + try: + with DICT_WRITING[abs_path]: + pass + except KeyError: + pass + if self.initialized: + return -def cut_for_search(sentence, HMM=True): - """ - Finer segmentation for search engines. - """ - words = cut(sentence, HMM=HMM) - for w in words: - if len(w) > 2: - for i in xrange(len(w) - 1): - gram2 = w[i:i + 2] - if FREQ.get(gram2): - yield gram2 - if len(w) > 3: - for i in xrange(len(w) - 2): - gram3 = w[i:i + 3] - if FREQ.get(gram3): - yield gram3 - yield w + default_logger.debug("Building prefix dict from %s ..." % abs_path) + t1 = time.time() + if self.cache_file: + cache_file = self.cache_file + # default dictionary + elif abs_path == DEFAULT_DICT: + cache_file = "jieba.cache" + else: # custom dictionary + cache_file = "jieba.u%s.cache" % md5( + abs_path.encode('utf-8', 'replace')).hexdigest() + cache_file = os.path.join( + self.tmp_dir or tempfile.gettempdir(), cache_file) + load_from_cache_fail = True + if os.path.isfile(cache_file) and os.path.getmtime(cache_file) > os.path.getmtime(abs_path): + default_logger.debug( + "Loading model from cache %s" % cache_file) + try: + with open(cache_file, 'rb') as cf: + self.FREQ, self.total = marshal.load(cf) + load_from_cache_fail = False + except Exception: + load_from_cache_fail = True -@require_initialized -def load_userdict(f): - ''' - Load personalized dict to improve detect rate. + if load_from_cache_fail: + wlock = DICT_WRITING.get(abs_path, threading.RLock()) + DICT_WRITING[abs_path] = wlock + with wlock: + self.FREQ, self.total = self.gen_pfdict(abs_path) + default_logger.debug( + "Dumping model to file cache %s" % cache_file) + try: + fd, fpath = tempfile.mkstemp() + with os.fdopen(fd, 'wb') as temp_cache_file: + marshal.dump( + (self.FREQ, self.total), temp_cache_file) + _replace_file(fpath, cache_file) + except Exception: + default_logger.exception("Dump cache file failed.") - Parameter: - - f : A plain text file contains words and their ocurrences. + try: + del DICT_WRITING[abs_path] + except KeyError: + pass - Structure of dict file: - word1 freq1 word_type1 - word2 freq2 word_type2 - ... - Word type may be ignored - ''' - if isinstance(f, string_types): - f = open(f, 'rb') - content = f.read().decode('utf-8').lstrip('\ufeff') - line_no = 0 - for line in content.splitlines(): - try: - line_no += 1 - line = line.strip() - if not line: - continue - tup = line.split(" ") - add_word(*tup) - except Exception as e: - logger.debug('%s at line %s %s' % (f.name, line_no, line)) - raise e + self.initialized = True + default_logger.debug( + "Loading model cost %.3f seconds." % (time.time() - t1)) + default_logger.debug("Prefix dict has been built succesfully.") + def check_initialized(self): + if not self.initialized: + self.initialize() -@require_initialized -def add_word(word, freq=None, tag=None): - """ - Add a word to dictionary. + def calc(self, sentence, DAG, route): + N = len(sentence) + route[N] = (0, 0) + logtotal = log(self.total) + for idx in xrange(N - 1, -1, -1): + route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) - + logtotal + route[x + 1][0], x) for x in DAG[idx]) - freq and tag can be omitted, freq defaults to be a calculated value - that ensures the word can be cut out. - """ - global FREQ, total, user_word_tag_tab - word = strdecode(word) - if freq is None: - freq = suggest_freq(word, False) - else: - freq = int(freq) - FREQ[word] = freq - total += freq - if tag is not None: - user_word_tag_tab[word] = tag - for ch in xrange(len(word)): - wfrag = word[:ch + 1] - if wfrag not in FREQ: - FREQ[wfrag] = 0 + def get_DAG(self, sentence): + self.check_initialized() + DAG = {} + N = len(sentence) + for k in xrange(N): + tmplist = [] + i = k + frag = sentence[k] + while i < N and frag in self.FREQ: + if self.FREQ[frag]: + tmplist.append(i) + i += 1 + frag = sentence[k:i + 1] + if not tmplist: + tmplist.append(k) + DAG[k] = tmplist + return DAG + def __cut_all(self, sentence): + dag = self.get_DAG(sentence) + old_j = -1 + for k, L in iteritems(dag): + if len(L) == 1 and k > old_j: + yield sentence[k:L[0] + 1] + old_j = L[0] + else: + for j in L: + if j > k: + yield sentence[k:j + 1] + old_j = j -def del_word(word): - """ - Convenient function for deleting a word. - """ - add_word(word, 0) + def __cut_DAG_NO_HMM(self, sentence): + DAG = self.get_DAG(sentence) + route = {} + self.calc(sentence, DAG, route) + x = 0 + N = len(sentence) + buf = '' + while x < N: + y = route[x][1] + 1 + l_word = sentence[x:y] + if re_eng.match(l_word) and len(l_word) == 1: + buf += l_word + x = y + else: + if buf: + yield buf + buf = '' + yield l_word + x = y + if buf: + yield buf + buf = '' + def __cut_DAG(self, sentence): + DAG = self.get_DAG(sentence) + route = {} + self.calc(sentence, DAG, route) + x = 0 + buf = '' + N = len(sentence) + while x < N: + y = route[x][1] + 1 + l_word = sentence[x:y] + if y - x == 1: + buf += l_word + else: + if buf: + if len(buf) == 1: + yield buf + buf = '' + else: + if not self.FREQ.get(buf): + recognized = finalseg.cut(buf) + for t in recognized: + yield t + else: + for elem in buf: + yield elem + buf = '' + yield l_word + x = y -@require_initialized -def suggest_freq(segment, tune=False): - """ - Suggest word frequency to force the characters in a word to be - joined or splitted. + if buf: + if len(buf) == 1: + yield buf + elif not self.FREQ.get(buf): + recognized = finalseg.cut(buf) + for t in recognized: + yield t + else: + for elem in buf: + yield elem - Parameter: - - segment : The segments that the word is expected to be cut into, - If the word should be treated as a whole, use a str. - - tune : If True, tune the word frequency. + def cut(self, sentence, cut_all=False, HMM=True): + ''' + The main function that segments an entire sentence that contains + Chinese characters into seperated words. - Note that HMM may affect the final result. If the result doesn't change, - set HMM=False. - """ - ftotal = float(total) - freq = 1 - if isinstance(segment, string_types): - word = segment - for seg in cut(word, HMM=False): - freq *= FREQ.get(seg, 1) / ftotal - freq = max(int(freq*total) + 1, FREQ.get(word, 1)) - else: - segment = tuple(map(strdecode, segment)) - word = ''.join(segment) - for seg in segment: - freq *= FREQ.get(seg, 1) / ftotal - freq = min(int(freq*total), FREQ.get(word, 0)) - if tune: - add_word(word, freq) - return freq + Parameter: + - sentence: The str(unicode) to be segmented. + - cut_all: Model type. True for full pattern, False for accurate pattern. + - HMM: Whether to use the Hidden Markov Model. + ''' + sentence = strdecode(sentence) - -__ref_cut = cut -__ref_cut_for_search = cut_for_search - - -def __lcut(sentence): - return list(__ref_cut(sentence, False)) - - -def __lcut_no_hmm(sentence): - return list(__ref_cut(sentence, False, False)) - - -def __lcut_all(sentence): - return list(__ref_cut(sentence, True)) - - -def __lcut_for_search(sentence): - return list(__ref_cut_for_search(sentence)) - - -@require_initialized -def enable_parallel(processnum=None): - global pool, cut, cut_for_search - if os.name == 'nt': - raise Exception("jieba: parallel mode only supports posix system") - from multiprocessing import Pool, cpu_count - if processnum is None: - processnum = cpu_count() - pool = Pool(processnum) - - def pcut(sentence, cut_all=False, HMM=True): - parts = strdecode(sentence).splitlines(True) if cut_all: - result = pool.map(__lcut_all, parts) - elif HMM: - result = pool.map(__lcut, parts) + re_han = re_han_cut_all + re_skip = re_skip_cut_all else: - result = pool.map(__lcut_no_hmm, parts) - for r in result: - for w in r: - yield w + re_han = re_han_default + re_skip = re_skip_default + if cut_all: + cut_block = self.__cut_all + elif HMM: + cut_block = self.__cut_DAG + else: + cut_block = self.__cut_DAG_NO_HMM + blocks = re_han.split(sentence) + for blk in blocks: + if not blk: + continue + if re_han.match(blk): + for word in cut_block(blk): + yield word + else: + tmp = re_skip.split(blk) + for x in tmp: + if re_skip.match(x): + yield x + elif not cut_all: + for xx in x: + yield xx + else: + yield x - def pcut_for_search(sentence): - parts = strdecode(sentence).splitlines(True) - result = pool.map(__lcut_for_search, parts) - for r in result: - for w in r: - yield w - - cut = pcut - cut_for_search = pcut_for_search - - -def disable_parallel(): - global pool, cut, cut_for_search - if pool: - pool.close() - pool = None - cut = __ref_cut - cut_for_search = __ref_cut_for_search - - -def set_dictionary(dictionary_path): - global initialized, DICTIONARY - with DICT_LOCK: - abs_path = os.path.normpath(os.path.join(os.getcwd(), dictionary_path)) - if not os.path.isfile(abs_path): - raise Exception("jieba: file does not exist: " + abs_path) - DICTIONARY = abs_path - initialized = False - - -def get_abs_path_dict(): - return os.path.join(_curpath, DICTIONARY) - - -def tokenize(unicode_sentence, mode="default", HMM=True): - """ - Tokenize a sentence and yields tuples of (word, start, end) - - Parameter: - - sentence: the str(unicode) to be segmented. - - mode: "default" or "search", "search" is for finer segmentation. - - HMM: whether to use the Hidden Markov Model. - """ - if not isinstance(unicode_sentence, text_type): - raise Exception("jieba: the input parameter should be unicode.") - start = 0 - if mode == 'default': - for w in cut(unicode_sentence, HMM=HMM): - width = len(w) - yield (w, start, start + width) - start += width - else: - for w in cut(unicode_sentence, HMM=HMM): - width = len(w) + def cut_for_search(self, sentence, HMM=True): + """ + Finer segmentation for search engines. + """ + words = self.cut(sentence, HMM=HMM) + for w in words: if len(w) > 2: for i in xrange(len(w) - 1): gram2 = w[i:i + 2] if FREQ.get(gram2): - yield (gram2, start + i, start + i + 2) + yield gram2 if len(w) > 3: for i in xrange(len(w) - 2): gram3 = w[i:i + 3] if FREQ.get(gram3): - yield (gram3, start + i, start + i + 3) - yield (w, start, start + width) - start += width + yield gram3 + yield w + + def lcut(self, *args, **kwargs): + return list(self.cut(*args, **kwargs)) + + def lcut_for_search(self, *args, **kwargs): + return list(self.cut_for_search(*args, **kwargs)) + + _lcut = lcut + _lcut_for_search = lcut_for_search + + def _lcut_no_hmm(self, sentence): + return self.lcut(sentence, False, False) + + def _lcut_all(self, sentence): + return self.lcut(sentence, True) + + def _lcut_for_search_no_hmm(self, sentence): + return self.lcut_for_search(sentence, False) + + def get_abs_path_dict(self): + return _get_abs_path(self.dictionary) + + def load_userdict(self, f): + ''' + Load personalized dict to improve detect rate. + + Parameter: + - f : A plain text file contains words and their ocurrences. + + Structure of dict file: + word1 freq1 word_type1 + word2 freq2 word_type2 + ... + Word type may be ignored + ''' + self.check_initialized() + if isinstance(f, string_types): + f = open(f, 'rb') + for lineno, ln in enumerate(f, 1): + try: + line = ln.strip().decode('utf-8').lstrip('\ufeff') + if not line: + continue + tup = line.split(" ") + self.add_word(*tup) + except Exception: + raise ValueError( + 'invalid dictionary entry in %s at Line %s: %s' % ( + f.name, lineno, line)) + + def add_word(self, word, freq=None, tag=None): + """ + Add a word to dictionary. + + freq and tag can be omitted, freq defaults to be a calculated value + that ensures the word can be cut out. + """ + self.check_initialized() + word = strdecode(word) + if freq is None: + freq = self.suggest_freq(word, False) + else: + freq = int(freq) + self.FREQ[word] = freq + self.total += freq + if tag is not None: + self.user_word_tag_tab[word] = tag + for ch in xrange(len(word)): + wfrag = word[:ch + 1] + if wfrag not in self.FREQ: + self.FREQ[wfrag] = 0 + + def del_word(self, word): + """ + Convenient function for deleting a word. + """ + self.add_word(word, 0) + + def suggest_freq(self, segment, tune=False): + """ + Suggest word frequency to force the characters in a word to be + joined or splitted. + + Parameter: + - segment : The segments that the word is expected to be cut into, + If the word should be treated as a whole, use a str. + - tune : If True, tune the word frequency. + + Note that HMM may affect the final result. If the result doesn't change, + set HMM=False. + """ + self.check_initialized() + ftotal = float(self.total) + freq = 1 + if isinstance(segment, string_types): + word = segment + for seg in self.cut(word, HMM=False): + freq *= self.FREQ.get(seg, 1) / ftotal + freq = max(int(freq * self.total) + 1, self.FREQ.get(word, 1)) + else: + segment = tuple(map(strdecode, segment)) + word = ''.join(segment) + for seg in segment: + freq *= self.FREQ.get(seg, 1) / ftotal + freq = min(int(freq * self.total), self.FREQ.get(word, 0)) + if tune: + add_word(word, freq) + return freq + + def tokenize(self, unicode_sentence, mode="default", HMM=True): + """ + Tokenize a sentence and yields tuples of (word, start, end) + + Parameter: + - sentence: the str(unicode) to be segmented. + - mode: "default" or "search", "search" is for finer segmentation. + - HMM: whether to use the Hidden Markov Model. + """ + if not isinstance(unicode_sentence, text_type): + raise ValueError("jieba: the input parameter should be unicode.") + start = 0 + if mode == 'default': + for w in self.cut(unicode_sentence, HMM=HMM): + width = len(w) + yield (w, start, start + width) + start += width + else: + for w in self.cut(unicode_sentence, HMM=HMM): + width = len(w) + if len(w) > 2: + for i in xrange(len(w) - 1): + gram2 = w[i:i + 2] + if self.FREQ.get(gram2): + yield (gram2, start + i, start + i + 2) + if len(w) > 3: + for i in xrange(len(w) - 2): + gram3 = w[i:i + 3] + if self.FREQ.get(gram3): + yield (gram3, start + i, start + i + 3) + yield (w, start, start + width) + start += width + + def set_dictionary(self, dictionary_path): + with self.lock: + abs_path = _get_abs_path(dictionary_path) + if not os.path.isfile(abs_path): + raise Exception("jieba: file does not exist: " + abs_path) + self.dictionary = abs_path + self.initialized = False + + +# default Tokenizer instance + +dt = Tokenizer() + +# global functions + +FREQ = dt.FREQ +add_word = dt.add_word +calc = dt.calc +cut = dt.cut +lcut = dt.lcut +cut_for_search = dt.cut_for_search +lcut_for_search = dt.lcut_for_search +del_word = dt.del_word +get_DAG = dt.get_DAG +get_abs_path_dict = dt.get_abs_path_dict +initialize = dt.initialize +load_userdict = dt.load_userdict +set_dictionary = dt.set_dictionary +suggest_freq = dt.suggest_freq +tokenize = dt.tokenize +user_word_tag_tab = dt.user_word_tag_tab + + +def _lcut_all(s): + return dt._lcut_all(s) + + +def _lcut(s): + return dt._lcut(s) + + +def _lcut_all(s): + return dt._lcut_all(s) + + +def _lcut_for_search(s): + return dt._lcut_for_search(s) + + +def _lcut_for_search_no_hmm(s): + return dt._lcut_for_search_no_hmm(s) + + +def _pcut(sentence, cut_all=False, HMM=True): + parts = strdecode(sentence).splitlines(True) + if cut_all: + result = pool.map(_lcut_all, parts) + elif HMM: + result = pool.map(_lcut, parts) + else: + result = pool.map(_lcut_no_hmm, parts) + for r in result: + for w in r: + yield w + + +def _pcut_for_search(sentence, HMM=True): + parts = strdecode(sentence).splitlines(True) + if HMM: + result = pool.map(_lcut_for_search, parts) + else: + result = pool.map(_lcut_for_search_no_hmm, parts) + for r in result: + for w in r: + yield w + + +def enable_parallel(processnum=None): + """ + Change the module's `cut` and `cut_for_search` functions to the + parallel version. + + Note that this only works using dt, custom Tokenizer + instances are not supported. + """ + global pool, dt, cut, cut_for_search + from multiprocessing import cpu_count + if os.name == 'nt': + raise NotImplementedError( + "jieba: parallel mode only supports posix system") + else: + from multiprocessing import Pool + dt.check_initialized() + if processnum is None: + processnum = cpu_count() + pool = Pool(processnum) + cut = _pcut + cut_for_search = _pcut_for_search + + +def disable_parallel(): + global pool, dt, cut, cut_for_search + if pool: + pool.close() + pool = None + cut = dt.cut + cut_for_search = dt.cut_for_search diff --git a/jieba/analyse/__init__.py b/jieba/analyse/__init__.py index da2514c..f956ef5 100755 --- a/jieba/analyse/__init__.py +++ b/jieba/analyse/__init__.py @@ -1,103 +1,18 @@ -#encoding=utf-8 from __future__ import absolute_import -import jieba -import jieba.posseg -import os -from operator import itemgetter -from .textrank import textrank +from .tfidf import TFIDF +from .textrank import TextRank try: from .analyzer import ChineseAnalyzer except ImportError: pass -_curpath = os.path.normpath(os.path.join(os.getcwd(), os.path.dirname(__file__))) -abs_path = os.path.join(_curpath, "idf.txt") +default_tfidf = TFIDF() +default_textrank = TextRank() -STOP_WORDS = set(( - "the","of","is","and","to","in","that","we","for","an","are", - "by","be","as","on","with","can","if","from","which","you","it", - "this","then","at","have","all","not","one","has","or","that" -)) - -class IDFLoader: - def __init__(self): - self.path = "" - self.idf_freq = {} - self.median_idf = 0.0 - - def set_new_path(self, new_idf_path): - if self.path != new_idf_path: - content = open(new_idf_path, 'rb').read().decode('utf-8') - idf_freq = {} - lines = content.rstrip('\n').split('\n') - for line in lines: - word, freq = line.split(' ') - idf_freq[word] = float(freq) - median_idf = sorted(idf_freq.values())[len(idf_freq)//2] - self.idf_freq = idf_freq - self.median_idf = median_idf - self.path = new_idf_path - - def get_idf(self): - return self.idf_freq, self.median_idf - -idf_loader = IDFLoader() -idf_loader.set_new_path(abs_path) - -def set_idf_path(idf_path): - new_abs_path = os.path.normpath(os.path.join(os.getcwd(), idf_path)) - if not os.path.exists(new_abs_path): - raise Exception("jieba: path does not exist: " + new_abs_path) - idf_loader.set_new_path(new_abs_path) +extract_tags = tfidf = default_tfidf.extract_tags +set_idf_path = default_tfidf.set_idf_path +textrank = default_textrank.extract_tags def set_stop_words(stop_words_path): - global STOP_WORDS - abs_path = os.path.normpath(os.path.join(os.getcwd(), stop_words_path)) - if not os.path.exists(abs_path): - raise Exception("jieba: path does not exist: " + abs_path) - content = open(abs_path,'rb').read().decode('utf-8') - lines = content.replace("\r", "").split('\n') - for line in lines: - STOP_WORDS.add(line) - -def extract_tags(sentence, topK=20, withWeight=False, allowPOS=[]): - """ - Extract keywords from sentence using TF-IDF algorithm. - Parameter: - - topK: return how many top keywords. `None` for all possible words. - - withWeight: if True, return a list of (word, weight); - if False, return a list of words. - - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr']. - if the POS of w is not in this list,it will be filtered. - """ - global STOP_WORDS, idf_loader - - idf_freq, median_idf = idf_loader.get_idf() - - if allowPOS: - allowPOS = frozenset(allowPOS) - words = jieba.posseg.cut(sentence) - else: - words = jieba.cut(sentence) - freq = {} - for w in words: - if allowPOS: - if w.flag not in allowPOS: - continue - else: - w = w.word - if len(w.strip()) < 2 or w.lower() in STOP_WORDS: - continue - freq[w] = freq.get(w, 0.0) + 1.0 - total = sum(freq.values()) - for k in freq: - freq[k] *= idf_freq.get(k, median_idf) / total - - if withWeight: - tags = sorted(freq.items(), key=itemgetter(1), reverse=True) - else: - tags = sorted(freq, key=freq.__getitem__, reverse=True) - if topK: - return tags[:topK] - else: - return tags + default_tfidf.set_stop_words(stop_words_path) + default_textrank.set_stop_words(stop_words_path) diff --git a/jieba/analyse/analyzer.py b/jieba/analyse/analyzer.py index 46de250..7f5d8f1 100644 --- a/jieba/analyse/analyzer.py +++ b/jieba/analyse/analyzer.py @@ -1,7 +1,7 @@ -#encoding=utf-8 +# encoding=utf-8 from __future__ import unicode_literals -from whoosh.analysis import RegexAnalyzer,LowercaseFilter,StopFilter,StemFilter -from whoosh.analysis import Tokenizer,Token +from whoosh.analysis import RegexAnalyzer, LowercaseFilter, StopFilter, StemFilter +from whoosh.analysis import Tokenizer, Token from whoosh.lang.porter import stem import jieba @@ -15,12 +15,14 @@ STOP_WORDS = frozenset(('a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'can', accepted_chars = re.compile(r"[\u4E00-\u9FA5]+") + class ChineseTokenizer(Tokenizer): + def __call__(self, text, **kargs): words = jieba.tokenize(text, mode="search") token = Token() - for (w,start_pos,stop_pos) in words: - if not accepted_chars.match(w) and len(w)<=1: + for (w, start_pos, stop_pos) in words: + if not accepted_chars.match(w) and len(w) <= 1: continue token.original = token.text = w token.pos = start_pos @@ -28,7 +30,8 @@ class ChineseTokenizer(Tokenizer): token.endchar = stop_pos yield token + def ChineseAnalyzer(stoplist=STOP_WORDS, minsize=1, stemfn=stem, cachesize=50000): return (ChineseTokenizer() | LowercaseFilter() | - StopFilter(stoplist=stoplist,minsize=minsize) | - StemFilter(stemfn=stemfn, ignore=None,cachesize=cachesize)) + StopFilter(stoplist=stoplist, minsize=minsize) | + StemFilter(stemfn=stemfn, ignore=None, cachesize=cachesize)) diff --git a/jieba/analyse/textrank.py b/jieba/analyse/textrank.py index 94d7f1b..019a1cb 100644 --- a/jieba/analyse/textrank.py +++ b/jieba/analyse/textrank.py @@ -3,9 +3,10 @@ from __future__ import absolute_import, unicode_literals import sys -import collections from operator import itemgetter -import jieba.posseg as pseg +from collections import defaultdict +import jieba.posseg +from .tfidf import KeywordExtractor from .._compat import * @@ -13,7 +14,7 @@ class UndirectWeightedGraph: d = 0.85 def __init__(self): - self.graph = collections.defaultdict(list) + self.graph = defaultdict(list) def addEdge(self, start, end, weight): # use a tuple (start, end, weight) instead of a Edge object @@ -21,8 +22,8 @@ class UndirectWeightedGraph: self.graph[end].append((end, start, weight)) def rank(self): - ws = collections.defaultdict(float) - outSum = collections.defaultdict(float) + ws = defaultdict(float) + outSum = defaultdict(float) wsdef = 1.0 / (len(self.graph) or 1.0) for n, out in self.graph.items(): @@ -53,43 +54,51 @@ class UndirectWeightedGraph: return ws -def textrank(sentence, topK=10, withWeight=False, allowPOS=['ns', 'n', 'vn', 'v']): - """ - Extract keywords from sentence using TextRank algorithm. - Parameter: - - topK: return how many top keywords. `None` for all possible words. - - withWeight: if True, return a list of (word, weight); - if False, return a list of words. - - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v']. - if the POS of w is not in this list,it will be filtered. - """ - pos_filt = frozenset(allowPOS) - g = UndirectWeightedGraph() - cm = collections.defaultdict(int) - span = 5 - words = list(pseg.cut(sentence)) - for i in xrange(len(words)): - if words[i].flag in pos_filt: - for j in xrange(i + 1, i + span): - if j >= len(words): - break - if words[j].flag not in pos_filt: - continue - cm[(words[i].word, words[j].word)] += 1 +class TextRank(KeywordExtractor): - for terms, w in cm.items(): - g.addEdge(terms[0], terms[1], w) - nodes_rank = g.rank() - if withWeight: - tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True) - else: - tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True) - if topK: - return tags[:topK] - else: - return tags + def __init__(self): + self.tokenizer = self.postokenizer = jieba.posseg.dt + self.stop_words = self.STOP_WORDS.copy() + self.pos_filt = frozenset(('ns', 'n', 'vn', 'v')) + self.span = 5 -if __name__ == '__main__': - s = "此外,公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元,增资后,吉林欧亚置业注册资本由7000万元增加到5亿元。吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城市商业综合体项目。2013年,实现营业收入0万元,实现净利润-139.13万元。" - for x, w in textrank(s, withWeight=True): - print('%s %s' % (x, w)) + def pairfilter(self, wp): + return (wp.flag in self.pos_filt and len(wp.word.strip()) >= 2 + and wp.word.lower() not in self.stop_words) + + def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')): + """ + Extract keywords from sentence using TextRank algorithm. + Parameter: + - topK: return how many top keywords. `None` for all possible words. + - withWeight: if True, return a list of (word, weight); + if False, return a list of words. + - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v']. + if the POS of w is not in this list, it will be filtered. + """ + self.pos_filt = frozenset(allowPOS) + g = UndirectWeightedGraph() + cm = defaultdict(int) + words = tuple(self.tokenizer.cut(sentence)) + for i, wp in enumerate(words): + if self.pairfilter(wp): + for j in xrange(i + 1, i + self.span): + if j >= len(words): + break + if not self.pairfilter(words[j]): + continue + cm[(wp.word, words[j].word)] += 1 + + for terms, w in cm.items(): + g.addEdge(terms[0], terms[1], w) + nodes_rank = g.rank() + if withWeight: + tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True) + else: + tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True) + if topK: + return tags[:topK] + else: + return tags + + extract_tags = textrank diff --git a/jieba/analyse/tfidf.py b/jieba/analyse/tfidf.py new file mode 100755 index 0000000..14abfb0 --- /dev/null +++ b/jieba/analyse/tfidf.py @@ -0,0 +1,111 @@ +# encoding=utf-8 +from __future__ import absolute_import +import os +import jieba +import jieba.posseg +from operator import itemgetter + +_get_module_path = lambda path: os.path.normpath(os.path.join(os.getcwd(), + os.path.dirname(__file__), path)) +_get_abs_path = jieba._get_abs_path + +DEFAULT_IDF = _get_module_path("idf.txt") + + +class KeywordExtractor(object): + + STOP_WORDS = set(( + "the", "of", "is", "and", "to", "in", "that", "we", "for", "an", "are", + "by", "be", "as", "on", "with", "can", "if", "from", "which", "you", "it", + "this", "then", "at", "have", "all", "not", "one", "has", "or", "that" + )) + + def set_stop_words(self, stop_words_path): + abs_path = _get_abs_path(stop_words_path) + if not os.path.isfile(abs_path): + raise Exception("jieba: file does not exist: " + abs_path) + content = open(abs_path, 'rb').read().decode('utf-8') + for line in content.splitlines(): + self.stop_words.add(line) + + def extract_tags(self, *args, **kwargs): + raise NotImplementedError + + +class IDFLoader(object): + + def __init__(self, idf_path=None): + self.path = "" + self.idf_freq = {} + self.median_idf = 0.0 + if idf_path: + self.set_new_path(idf_path) + + def set_new_path(self, new_idf_path): + if self.path != new_idf_path: + self.path = new_idf_path + content = open(new_idf_path, 'rb').read().decode('utf-8') + self.idf_freq = {} + for line in content.splitlines(): + word, freq = line.strip().split(' ') + self.idf_freq[word] = float(freq) + self.median_idf = sorted( + self.idf_freq.values())[len(self.idf_freq) // 2] + + def get_idf(self): + return self.idf_freq, self.median_idf + + +class TFIDF(KeywordExtractor): + + def __init__(self, idf_path=None): + self.tokenizer = jieba.dt + self.postokenizer = jieba.posseg.dt + self.stop_words = self.STOP_WORDS.copy() + self.idf_loader = IDFLoader(idf_path or DEFAULT_IDF) + self.idf_freq, self.median_idf = self.idf_loader.get_idf() + + def set_idf_path(self, idf_path): + new_abs_path = _get_abs_path(idf_path) + if not os.path.isfile(new_abs_path): + raise Exception("jieba: file does not exist: " + new_abs_path) + self.idf_loader.set_new_path(new_abs_path) + self.idf_freq, self.median_idf = self.idf_loader.get_idf() + + def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=()): + """ + Extract keywords from sentence using TF-IDF algorithm. + Parameter: + - topK: return how many top keywords. `None` for all possible words. + - withWeight: if True, return a list of (word, weight); + if False, return a list of words. + - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr']. + if the POS of w is not in this list,it will be filtered. + """ + if allowPOS: + allowPOS = frozenset(allowPOS) + words = self.postokenizer.cut(sentence) + else: + words = self.tokenizer.cut(sentence) + freq = {} + for w in words: + if allowPOS: + if w.flag not in allowPOS: + continue + else: + w = w.word + if len(w.strip()) < 2 or w.lower() in self.stop_words: + continue + freq[w] = freq.get(w, 0.0) + 1.0 + total = sum(freq.values()) + for k in freq: + freq[k] *= self.idf_freq.get(k, self.median_idf) / total + + if withWeight: + tags = sorted(freq.items(), key=itemgetter(1), reverse=True) + else: + tags = sorted(freq, key=freq.__getitem__, reverse=True) + if topK: + return tags[:topK] + else: + return tags diff --git a/jieba/posseg/__init__.py b/jieba/posseg/__init__.py index 680050c..3133233 100644 --- a/jieba/posseg/__init__.py +++ b/jieba/posseg/__init__.py @@ -1,10 +1,9 @@ from __future__ import absolute_import, unicode_literals -import re import os -import jieba +import re import sys +import jieba import marshal -from functools import wraps from .._compat import * from .viterbi import viterbi @@ -24,23 +23,10 @@ re_num = re.compile("[\.0-9]+") re_eng1 = re.compile('^[a-zA-Z0-9]$', re.U) -def load_model(f_name, isJython=True): +def load_model(f_name): _curpath = os.path.normpath( os.path.join(os.getcwd(), os.path.dirname(__file__))) - - result = {} - with open(f_name, "rb") as f: - for line in f: - line = line.strip() - if not line: - continue - line = line.decode("utf-8") - word, _, tag = line.split(" ") - result[word] = tag - - if not isJython: - return result - + # For Jython start_p = {} abs_path = os.path.join(_curpath, PROB_START_P) with open(abs_path, 'rb') as f: @@ -64,29 +50,15 @@ def load_model(f_name, isJython=True): return state, start_p, trans_p, emit_p, result + if sys.platform.startswith("java"): - char_state_tab_P, start_P, trans_P, emit_P, word_tag_tab = load_model( - jieba.get_abs_path_dict()) + char_state_tab_P, start_P, trans_P, emit_P, word_tag_tab = load_model() else: from .char_state_tab import P as char_state_tab_P from .prob_start import P as start_P from .prob_trans import P as trans_P from .prob_emit import P as emit_P - word_tag_tab = load_model(jieba.get_abs_path_dict(), isJython=False) - - -def makesure_userdict_loaded(fn): - - @wraps(fn) - def wrapped(*args, **kwargs): - if jieba.user_word_tag_tab: - word_tag_tab.update(jieba.user_word_tag_tab) - jieba.user_word_tag_tab = {} - return fn(*args, **kwargs) - - return wrapped - class pair(object): @@ -110,154 +82,220 @@ class pair(object): return self.__unicode__().encode(arg) -def __cut(sentence): - prob, pos_list = viterbi( - sentence, char_state_tab_P, start_P, trans_P, emit_P) - begin, nexti = 0, 0 +class POSTokenizer(object): - for i, char in enumerate(sentence): - pos = pos_list[i][0] - if pos == 'B': - begin = i - elif pos == 'E': - yield pair(sentence[begin:i + 1], pos_list[i][1]) - nexti = i + 1 - elif pos == 'S': - yield pair(char, pos_list[i][1]) - nexti = i + 1 - if nexti < len(sentence): - yield pair(sentence[nexti:], pos_list[nexti][1]) + def __init__(self, tokenizer=None): + self.tokenizer = tokenizer or jieba.Tokenizer() + self.load_word_tag(self.tokenizer.get_abs_path_dict()) + def __repr__(self): + return '' % self.tokenizer -def __cut_detail(sentence): - blocks = re_han_detail.split(sentence) - for blk in blocks: - if re_han_detail.match(blk): - for word in __cut(blk): - yield word - else: - tmp = re_skip_detail.split(blk) - for x in tmp: - if x: - if re_num.match(x): - yield pair(x, 'm') - elif re_eng.match(x): - yield pair(x, 'eng') - else: - yield pair(x, 'x') + def __getattr__(self, name): + if name in ('cut_for_search', 'lcut_for_search', 'tokenize'): + # may be possible? + raise NotImplementedError + return getattr(self.tokenizer, name) + def initialize(self, dictionary=None): + self.tokenizer.initialize(dictionary) + self.load_word_tag(self.tokenizer.get_abs_path_dict()) -def __cut_DAG_NO_HMM(sentence): - DAG = jieba.get_DAG(sentence) - route = {} - jieba.calc(sentence, DAG, route) - x = 0 - N = len(sentence) - buf = '' - while x < N: - y = route[x][1] + 1 - l_word = sentence[x:y] - if re_eng1.match(l_word): - buf += l_word - x = y - else: - if buf: - yield pair(buf, 'eng') - buf = '' - yield pair(l_word, word_tag_tab.get(l_word, 'x')) - x = y - if buf: - yield pair(buf, 'eng') - buf = '' + def load_word_tag(self, f_name): + self.word_tag_tab = {} + with open(f_name, "rb") as f: + for lineno, line in enumerate(f, 1): + try: + line = line.strip().decode("utf-8") + if not line: + continue + word, _, tag = line.split(" ") + self.word_tag_tab[word] = tag + except Exception: + raise ValueError( + 'invalid POS dictionary entry in %s at Line %s: %s' % (f_name, lineno, line)) + def makesure_userdict_loaded(self): + if self.tokenizer.user_word_tag_tab: + self.word_tag_tab.update(self.tokenizer.user_word_tag_tab) + self.tokenizer.user_word_tag_tab = {} -def __cut_DAG(sentence): - DAG = jieba.get_DAG(sentence) - route = {} + def __cut(self, sentence): + prob, pos_list = viterbi( + sentence, char_state_tab_P, start_P, trans_P, emit_P) + begin, nexti = 0, 0 - jieba.calc(sentence, DAG, route) + for i, char in enumerate(sentence): + pos = pos_list[i][0] + if pos == 'B': + begin = i + elif pos == 'E': + yield pair(sentence[begin:i + 1], pos_list[i][1]) + nexti = i + 1 + elif pos == 'S': + yield pair(char, pos_list[i][1]) + nexti = i + 1 + if nexti < len(sentence): + yield pair(sentence[nexti:], pos_list[nexti][1]) - x = 0 - buf = '' - N = len(sentence) - while x < N: - y = route[x][1] + 1 - l_word = sentence[x:y] - if y - x == 1: - buf += l_word - else: - if buf: - if len(buf) == 1: - yield pair(buf, word_tag_tab.get(buf, 'x')) - elif not jieba.FREQ.get(buf): - recognized = __cut_detail(buf) - for t in recognized: - yield t - else: - for elem in buf: - yield pair(elem, word_tag_tab.get(elem, 'x')) - buf = '' - yield pair(l_word, word_tag_tab.get(l_word, 'x')) - x = y - - if buf: - if len(buf) == 1: - yield pair(buf, word_tag_tab.get(buf, 'x')) - elif not jieba.FREQ.get(buf): - recognized = __cut_detail(buf) - for t in recognized: - yield t - else: - for elem in buf: - yield pair(elem, word_tag_tab.get(elem, 'x')) - - -def __cut_internal(sentence, HMM=True): - sentence = strdecode(sentence) - blocks = re_han_internal.split(sentence) - if HMM: - __cut_blk = __cut_DAG - else: - __cut_blk = __cut_DAG_NO_HMM - - for blk in blocks: - if re_han_internal.match(blk): - for word in __cut_blk(blk): - yield word - else: - tmp = re_skip_internal.split(blk) - for x in tmp: - if re_skip_internal.match(x): - yield pair(x, 'x') - else: - for xx in x: - if re_num.match(xx): - yield pair(xx, 'm') + def __cut_detail(self, sentence): + blocks = re_han_detail.split(sentence) + for blk in blocks: + if re_han_detail.match(blk): + for word in self.__cut(blk): + yield word + else: + tmp = re_skip_detail.split(blk) + for x in tmp: + if x: + if re_num.match(x): + yield pair(x, 'm') elif re_eng.match(x): - yield pair(xx, 'eng') + yield pair(x, 'eng') else: - yield pair(xx, 'x') + yield pair(x, 'x') + + def __cut_DAG_NO_HMM(self, sentence): + DAG = self.tokenizer.get_DAG(sentence) + route = {} + self.tokenizer.calc(sentence, DAG, route) + x = 0 + N = len(sentence) + buf = '' + while x < N: + y = route[x][1] + 1 + l_word = sentence[x:y] + if re_eng1.match(l_word): + buf += l_word + x = y + else: + if buf: + yield pair(buf, 'eng') + buf = '' + yield pair(l_word, self.word_tag_tab.get(l_word, 'x')) + x = y + if buf: + yield pair(buf, 'eng') + buf = '' + + def __cut_DAG(self, sentence): + DAG = self.tokenizer.get_DAG(sentence) + route = {} + + self.tokenizer.calc(sentence, DAG, route) + + x = 0 + buf = '' + N = len(sentence) + while x < N: + y = route[x][1] + 1 + l_word = sentence[x:y] + if y - x == 1: + buf += l_word + else: + if buf: + if len(buf) == 1: + yield pair(buf, self.word_tag_tab.get(buf, 'x')) + elif not self.tokenizer.FREQ.get(buf): + recognized = self.__cut_detail(buf) + for t in recognized: + yield t + else: + for elem in buf: + yield pair(elem, self.word_tag_tab.get(elem, 'x')) + buf = '' + yield pair(l_word, self.word_tag_tab.get(l_word, 'x')) + x = y + + if buf: + if len(buf) == 1: + yield pair(buf, self.word_tag_tab.get(buf, 'x')) + elif not self.tokenizer.FREQ.get(buf): + recognized = self.__cut_detail(buf) + for t in recognized: + yield t + else: + for elem in buf: + yield pair(elem, self.word_tag_tab.get(elem, 'x')) + + def __cut_internal(self, sentence, HMM=True): + self.makesure_userdict_loaded() + sentence = strdecode(sentence) + blocks = re_han_internal.split(sentence) + if HMM: + cut_blk = self.__cut_DAG + else: + cut_blk = self.__cut_DAG_NO_HMM + + for blk in blocks: + if re_han_internal.match(blk): + for word in cut_blk(blk): + yield word + else: + tmp = re_skip_internal.split(blk) + for x in tmp: + if re_skip_internal.match(x): + yield pair(x, 'x') + else: + for xx in x: + if re_num.match(xx): + yield pair(xx, 'm') + elif re_eng.match(x): + yield pair(xx, 'eng') + else: + yield pair(xx, 'x') + + def _lcut_internal(self, sentence): + return list(self.__cut_internal(sentence)) + + def _lcut_internal_no_hmm(self, sentence): + return list(self.__cut_internal(sentence, False)) + + def cut(self, sentence, HMM=True): + for w in self.__cut_internal(sentence, HMM=HMM): + yield w + + def lcut(self, *args, **kwargs): + return list(self.cut(*args, **kwargs)) + +# default Tokenizer instance + +dt = POSTokenizer(jieba.dt) + +# global functions + +initialize = dt.initialize -def __lcut_internal(sentence): - return list(__cut_internal(sentence)) +def _lcut_internal(s): + return dt._lcut_internal(s) -def __lcut_internal_no_hmm(sentence): - return list(__cut_internal(sentence, False)) +def _lcut_internal_no_hmm(s): + return dt._lcut_internal_no_hmm(s) -@makesure_userdict_loaded def cut(sentence, HMM=True): + """ + Global `cut` function that supports parallel processing. + + Note that this only works using dt, custom POSTokenizer + instances are not supported. + """ + global dt if jieba.pool is None: - for w in __cut_internal(sentence, HMM=HMM): + for w in dt.cut(sentence, HMM=HMM): yield w else: parts = strdecode(sentence).splitlines(True) if HMM: - result = jieba.pool.map(__lcut_internal, parts) + result = jieba.pool.map(_lcut_internal, parts) else: - result = jieba.pool.map(__lcut_internal_no_hmm, parts) + result = jieba.pool.map(_lcut_internal_no_hmm, parts) for r in result: for w in r: yield w + + +def lcut(sentence, HMM=True): + return list(cut(sentence, HMM)) diff --git a/test/demo.py b/test/demo.py index 84377ae..6ebb159 100644 --- a/test/demo.py +++ b/test/demo.py @@ -4,6 +4,12 @@ import sys sys.path.append("../") import jieba +import jieba.posseg +import jieba.analyse + +print('='*40) +print('1. 分词') +print('-'*40) seg_list = jieba.cut("我来到北京清华大学", cut_all=True) print("Full Mode: " + "/ ".join(seg_list)) # 全模式 @@ -16,3 +22,63 @@ print(", ".join(seg_list)) seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式 print(", ".join(seg_list)) + +print('='*40) +print('2. 添加自定义词典/调整词典') +print('-'*40) + +print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False))) +#如果/放到/post/中将/出错/。 +print(jieba.suggest_freq(('中', '将'), True)) +#494 +print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False))) +#如果/放到/post/中/将/出错/。 +print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False))) +#「/台/中/」/正确/应该/不会/被/切开 +print(jieba.suggest_freq('台中', True)) +#69 +print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False))) +#「/台中/」/正确/应该/不会/被/切开 + +print('='*40) +print('3. 关键词提取') +print('-'*40) +print(' TF-IDF') +print('-'*40) + +s = "此外,公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元,增资后,吉林欧亚置业注册资本由7000万元增加到5亿元。吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城市商业综合体项目。2013年,实现营业收入0万元,实现净利润-139.13万元。" +for x, w in jieba.analyse.extract_tags(s, withWeight=True): + print('%s %s' % (x, w)) + +print('-'*40) +print(' TextRank') +print('-'*40) + +for x, w in jieba.analyse.textrank(s, withWeight=True): + print('%s %s' % (x, w)) + +print('='*40) +print('4. 词性标注') +print('-'*40) + +words = jieba.posseg.cut("我爱北京天安门") +for w in words: + print('%s %s' % (w.word, w.flag)) + +print('='*40) +print('6. Tokenize: 返回词语在原文的起止位置') +print('-'*40) +print(' 默认模式') +print('-'*40) + +result = jieba.tokenize('永和服装饰品有限公司') +for tk in result: + print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2])) + +print('-'*40) +print(' 搜索模式') +print('-'*40) + +result = jieba.tokenize('永和服装饰品有限公司', mode='search') +for tk in result: + print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2])) diff --git a/test/test_lock.py b/test/test_lock.py new file mode 100644 index 0000000..b7fcc97 --- /dev/null +++ b/test/test_lock.py @@ -0,0 +1,42 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import jieba +import threading + +def inittokenizer(tokenizer, group): + print('===> Thread %s:%s started' % (group, threading.current_thread().ident)) + tokenizer.initialize() + print('<=== Thread %s:%s finished' % (group, threading.current_thread().ident)) + +tokrs1 = [jieba.Tokenizer() for n in range(5)] +tokrs2 = [jieba.Tokenizer('../extra_dict/dict.txt.small') for n in range(5)] + +thr1 = [threading.Thread(target=inittokenizer, args=(tokr, 1)) for tokr in tokrs1] +thr2 = [threading.Thread(target=inittokenizer, args=(tokr, 2)) for tokr in tokrs2] +for thr in thr1: + thr.start() +for thr in thr2: + thr.start() +for thr in thr1: + thr.join() +for thr in thr2: + thr.join() + +del tokrs1, tokrs2 + +print('='*40) + +tokr1 = jieba.Tokenizer() +tokr2 = jieba.Tokenizer('../extra_dict/dict.txt.small') + +thr1 = [threading.Thread(target=inittokenizer, args=(tokr1, 1)) for n in range(5)] +thr2 = [threading.Thread(target=inittokenizer, args=(tokr2, 2)) for n in range(5)] +for thr in thr1: + thr.start() +for thr in thr2: + thr.start() +for thr in thr1: + thr.join() +for thr in thr2: + thr.join()