Merge pull request #246 from gumblex/master

增加自动词频
2025-07-10 00:01:33 +08:00 · 2015-03-16 10:10:53 +08:00 · 2015-03-16 10:10:53 +08:00 · 59aa8b69b1
commit 59aa8b69b1
parent 1b4721ebb8 4fa2728fb6
4 changed files with 202 additions and 31 deletions
--- a/README.md
+++ b/README.md
@ -16,6 +16,7 @@ jieba
 * 支持繁体分词
 * 支持自定义词典
 * MIT 授权协议
 在线演示
 =========
@ -93,9 +94,13 @@ print(", ".join(seg_list))
 2) ：添加自定义词典
 ----------------
 ### 载入词典
 * 开发者可以指定自己自定义的词典，以便包含 jieba 词库里没有的词。虽然 jieba 有新词识别能力，但是自行添加新词可以保证更高的正确率
 * 用法： jieba.load_userdict(file_name) # file_name 为自定义词典的路径
-* 词典格式和`dict.txt`一样，一个词占一行；每一行分三部分，一部分为词语，另一部分为词频，最后为词性（可省略），用空格隔开
+* 词典格式和`dict.txt`一样，一个词占一行；每一行分三部分，一部分为词语，另一部分为词频（可省略），最后为词性（可省略），用空格隔开
 * 词频可省略，使用计算出的能保证分出该词的词频
 * 范例：
    * 自定义词典：https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
@ -107,6 +112,29 @@ print(", ".join(seg_list))
        * 加载自定义词库后：　李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
 ### 调整词典
 * 使用 `add_word(word, freq=None, tag=None)` 和 `del_word(word)` 可在程序中动态修改词典。
 * 使用 `suggest_freq(segment, tune=True)` 可调节单个词语的词频，使其能（或不能）被分出来。
 * 注意：自动计算的词频在使用 HMM 新词发现功能时可能无效。
 代码示例：
 ```pycon
 >>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
 如果/放到/post/中将/出错/。
 >>> jieba.suggest_freq(('中', '将'), True)
 494
 >>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
 如果/放到/post/中/将/出错/。
 >>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
 「/台/中/」/正确/应该/不会/被/切开
 >>> jieba.suggest_freq('台中', True)
 69
 >>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
 「/台中/」/正确/应该/不会/被/切开
 ```
 * "通过用户自定义词典来增强歧义纠错能力" --- https://github.com/fxsjy/jieba/issues/14
@ -362,10 +390,35 @@ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
 常见问题
 =========
 1. 模型的数据是如何生成的？https://github.com/fxsjy/jieba/issues/7
 2. 这个库的授权是? https://github.com/fxsjy/jieba/issues/2
-* 更多问题请点击：https://github.com/fxsjy/jieba/issues?sort=updated&state=closed
+## 1. 模型的数据是如何生成的？
 详见： https://github.com/fxsjy/jieba/issues/7
 ## 2. “台中”总是被切成“台 中”？（以及类似情况）
 P(台中) ＜ P(台)×P(中)，“台中”词频不够导致其成词概率较低
 解决方法：强制调高词频
 `jieba.add_word('台中')` 或者 `jieba.suggest_freq('台中', True)`
 ## 3. “今天天气 不错”应该被切成“今天 天气 不错”？（以及类似情况）
 解决方法：强制调低词频
 `jieba.suggest_freq(('今天', '天气'), True)`
 或者直接删除该词 `jieba.del_word('今天天气')`
 ## 4. 切出了词典中没有的词语，效果不理想？
 解决方法：关闭新词发现
 `jieba.cut('丰田太省了', HMM=False)`
 `jieba.cut('我们中出了一个叛徒', HMM=False)`
 **更多问题请点击**：https://github.com/fxsjy/jieba/issues?sort=updated&state=closed
 修订历史
 ==========
@ -380,9 +433,15 @@ jieba
 Features
 ========
 * Support three types of segmentation mode:
-* 1) Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
+
-* 2) Full Mode gets all the possible words from the sentence. Fast but not accurate.
+1. Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
-* 3) Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
+2. Full Mode gets all the possible words from the sentence. Fast but not accurate.
 3. Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
 * Supports Traditional Chinese
 * Supports customized dictionaries
 * MIT License
 Online demo
 =========
@ -446,6 +505,8 @@ Output:
 2) : Add a custom dictionary
 ----------------------------
 ###　Load dictionary
 * Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but adding your own new words can ensure a higher accuracy.
 * Usage： `jieba.load_userdict(file_name) # file_name is the path of the custom dictionary`
 * The dictionary format is the same as that of `analyse/idf.txt`: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space
@ -459,6 +520,31 @@ Output:
        [After]：　李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
 ### Modify dictionary
 * Use `add_word(word, freq=None, tag=None)` and `del_word(word)` to modify the dictionary dynamically in programs.
 * Use `suggest_freq(segment, tune=True)` to adjust the frequency of a single word so that it can (or cannot) be segmented.
 * Note that HMM may affect the final result.
 Example:
 ```pycon
 >>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
 如果/放到/post/中将/出错/。
 >>> jieba.suggest_freq(('中', '将'), True)
 494
 >>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
 如果/放到/post/中/将/出错/。
 >>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
 「/台/中/」/正确/应该/不会/被/切开
 >>> jieba.suggest_freq('台中', True)
 69
 >>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
 「/台中/」/正确/应该/不会/被/切开
 ```
 3) : Keyword Extraction
 -----------------------
 * `jieba.analyse.extract_tags(sentence,topK,withWeight) # needs to first import jieba.analyse`
--- a/jieba/init.py
+++ b/jieba/init.py
@ -240,8 +240,10 @@ re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)
 def cut(sentence, cut_all=False, HMM=True):
-    '''The main function that segments an entire sentence that contains
+    '''
    The main function that segments an entire sentence that contains
    Chinese characters into seperated words.
    Parameter:
        - sentence: The str(unicode) to be segmented.
        - cut_all: Model type. True for full pattern, False for accurate pattern.
@ -284,6 +286,9 @@ def cut(sentence, cut_all=False, HMM=True):
 def cut_for_search(sentence, HMM=True):
    """
    Finer segmentation for search engines.
    """
    words = cut(sentence, HMM=HMM)
    for w in words:
        if len(w) > 2:
@ -301,9 +306,12 @@ def cut_for_search(sentence, HMM=True):
@require_initialized
 def load_userdict(f):
-    ''' Load personalized dict to improve detect rate.
+    '''
    Load personalized dict to improve detect rate.
    Parameter:
        - f : A plain text file contains words and their ocurrences.
    Structure of dict file:
    word1 freq1 word_type1
    word2 freq2 word_type2
@ -315,17 +323,31 @@ def load_userdict(f):
    content = f.read().decode('utf-8').lstrip('\ufeff')
    line_no = 0
    for line in content.splitlines():
        try:
            line_no += 1
-        if not line.rstrip():
+            line = line.strip()
            if not line:
                continue
-        tup = line.strip().split(" ")
+            tup = line.split(" ")
        if tup[1].isdigit():
            add_word(*tup)
        except Exception as e:
            logger.debug('%s at line %s %s' % (f_name, lineno, line))
            raise e
@require_initialized
-def add_word(word, freq, tag=None):
+def add_word(word, freq=None, tag=None):
    """
    Add a word to dictionary.
    freq and tag can be omitted, freq defaults to be a calculated value
    that ensures the word can be cut out.
    """
    global FREQ, total, user_word_tag_tab
    word = strdecode(word)
    if freq is None:
        freq = suggest_freq(word, False)
    else:
        freq = int(freq)
    FREQ[word] = freq
    total += freq
@ -336,6 +358,46 @@ def add_word(word, freq, tag=None):
        if wfrag not in FREQ:
            FREQ[wfrag] = 0
 def del_word(word):
    """
    Convenient function for deleting a word.
    """
    add_word(word, 0)
@require_initialized
 def suggest_freq(segment, tune=False):
    """
    Suggest word frequency to force the characters in a word to be
    joined or splitted.
    Parameter:
        - segment : The segments that the word is expected to be cut into,
                    If the word should be treated as a whole, use a str.
        - tune : If True, tune the word frequency.
    Note that HMM may affect the final result. If the result doesn't change,
    set HMM=False.
    """
    ftotal = float(total)
    freq = 1
    if isinstance(segment, string_types):
        word = segment
        for seg in cut(word, HMM=False):
            freq *= FREQ.get(seg, 1) / ftotal
        freq = max(int(freq*total) + 1, FREQ.get(word, 1))
    else:
        segment = tuple(map(strdecode, segment))
        word = ''.join(segment)
        for seg in segment:
            freq *= FREQ.get(seg, 1) / ftotal
        freq = min(int(freq*total), FREQ.get(word, 0))
    if tune:
        add_word(word, freq)
    return freq
 __ref_cut = cut
 __ref_cut_for_search = cut_for_search
@ -402,8 +464,8 @@ def set_dictionary(dictionary_path):
    global initialized, DICTIONARY
    with DICT_LOCK:
        abs_path = os.path.normpath(os.path.join(os.getcwd(), dictionary_path))
-        if not os.path.exists(abs_path):
+        if not os.path.isfile(abs_path):
-            raise Exception("jieba: path does not exist: " + abs_path)
+            raise Exception("jieba: file does not exist: " + abs_path)
        DICTIONARY = abs_path
        initialized = False
@ -413,7 +475,9 @@ def get_abs_path_dict():
 def tokenize(unicode_sentence, mode="default", HMM=True):
-    """Tokenize a sentence and yields tuples of (word, start, end)
+    """
    Tokenize a sentence and yields tuples of (word, start, end)
    Parameter:
        - sentence: the str(unicode) to be segmented.
        - mode: "default" or "search", "search" is for finer segmentation.
--- a/test/test_userdict.py
+++ b/test/test_userdict.py
@ -1,28 +1,48 @@
 #encoding=utf-8
-from __future__ import print_function
+from __future__ import print_function, unicode_literals
 import sys
 sys.path.append("../")
 import jieba
 jieba.load_userdict("userdict.txt")
 import jieba.posseg as pseg
-test_sent = "李小福是创新办主任也是云计算方面的专家; 什么是八一双鹿"
+jieba.add_word('石墨烯')
-test_sent += "例如我输入一个带“韩玉赏鉴”的标题，在自定义词库中也增加了此词为N类型"
+jieba.add_word('凱特琳')
 jieba.del_word('自定义词')
 test_sent = (
 "李小福是创新办主任也是云计算方面的专家; 什么是八一双鹿\n"
 "例如我输入一个带“韩玉赏鉴”的标题，在自定义词库中也增加了此词为N类\n"
 "「台中」正確應該不會被切開。mac上可分出「石墨烯」；此時又可以分出來凱特琳了。"
 )
 words = jieba.cut(test_sent)
-for w in words:
+print('/'.join(words))
-    print(w)
+
 print("="*40)
 result = pseg.cut(test_sent)
 for w in result:
    print(w.word, "/", w.flag, ", ", end=' ')
-print("\n========")
+print("\n" + "="*40)
 terms = jieba.cut('easy_install is great')
-for t in terms:
+print('/'.join(terms))
    print(t)
 print('-------------------------')
 terms = jieba.cut('python 的正则表达式是好用的')
-for t in terms:
+print('/'.join(terms))
-    print(t)
+
 print("="*40)
 # test frequency tune
 testlist = [
 ('今天天气不错', ('今天', '天气')),
 ('如果放到post中将出错。', ('中', '将')),
 ('我们中出了一个叛徒', ('中', '出')),
 ]
 for sent, seg in testlist:
    print('/'.join(jieba.cut(sent, HMM=False)))
    word = ''.join(seg)
    print('%s Before: %s, After: %s' % (word, jieba.FREQ[word], jieba.suggest_freq(seg, True)))
    print('/'.join(jieba.cut(sent, HMM=False)))
    print("-"*40)
--- a/test/userdict.txt
+++ b/test/userdict.txt
@ -5,3 +5,4 @@ easy_install 3 eng
 好用 300
 韩玉赏鉴 3 nz
 八一双鹿 3 nz
 台中