mirror of
https://github.com/fxsjy/jieba.git
synced 2025-07-10 00:01:33 +08:00
details in textrank; update README
This commit is contained in:
parent
a4fb439070
commit
f29430f49e
30
README.md
30
README.md
@ -1,7 +1,9 @@
|
|||||||
jieba
|
jieba
|
||||||
========
|
========
|
||||||
"结巴"中文分词:做最好的 Python 中文分词组件
|
“结巴”中文分词:做最好的 Python 中文分词组件
|
||||||
|
|
||||||
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.
|
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.
|
||||||
|
|
||||||
- _Scroll down for English documentation._
|
- _Scroll down for English documentation._
|
||||||
|
|
||||||
|
|
||||||
@ -29,22 +31,20 @@ http://jiebademo.ap01.aws.af.cm/
|
|||||||
Python 2.x
|
Python 2.x
|
||||||
-----------
|
-----------
|
||||||
* 全自动安装:`easy_install jieba` 或者 `pip install jieba`
|
* 全自动安装:`easy_install jieba` 或者 `pip install jieba`
|
||||||
* 半自动安装:先下载 http://pypi.python.org/pypi/jieba/ ,解压后运行 python setup.py install
|
* 半自动安装:先下载 http://pypi.python.org/pypi/jieba/ ,解压后运行 `python setup.py install`
|
||||||
* 手动安装:将 jieba 目录放置于当前目录或者 site-packages 目录
|
* 手动安装:将 jieba 目录放置于当前目录或者 site-packages 目录
|
||||||
* 通过 `import jieba` 来引用
|
* 通过 `import jieba` 来引用
|
||||||
|
|
||||||
Python 3.x
|
Python 3.x
|
||||||
-----------
|
-----------
|
||||||
* 目前 master 分支是只支持 Python2.x 的
|
* 目前 master 分支对 Python 2/3 兼容
|
||||||
* Python 3.x 版本的分支也已经基本可用: https://github.com/fxsjy/jieba/tree/jieba3k
|
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
git clone https://github.com/fxsjy/jieba.git
|
git clone https://github.com/fxsjy/jieba.git
|
||||||
git checkout jieba3k
|
python3 setup.py install
|
||||||
python setup.py install
|
|
||||||
```
|
```
|
||||||
|
|
||||||
* 或使用pip3安装: pip3 install jieba3k
|
* 或使用pip3安装旧版本: pip3 install jieba3k
|
||||||
|
|
||||||
算法
|
算法
|
||||||
========
|
========
|
||||||
@ -58,7 +58,7 @@ python setup.py install
|
|||||||
--------
|
--------
|
||||||
* `jieba.cut` 方法接受三个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用 HMM 模型
|
* `jieba.cut` 方法接受三个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用 HMM 模型
|
||||||
* `jieba.cut_for_search` 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
|
* `jieba.cut_for_search` 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
|
||||||
* 注意:待分词的字符串可以是 GBK 字符串、UTF-8 字符串或者 unicode
|
* 待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意:不建议直接输入 GBK 字符串,可能无法预料地错误解码成 UTF-8
|
||||||
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator,可以使用 for 循环来获得分词后得到的每一个词语(unicode),也可以用 list(jieba.cut(...)) 转化为 list
|
* `jieba.cut` 以及 `jieba.cut_for_search` 返回的结构都是一个可迭代的 generator,可以使用 for 循环来获得分词后得到的每一个词语(unicode),也可以用 list(jieba.cut(...)) 转化为 list
|
||||||
|
|
||||||
代码示例( 分词 )
|
代码示例( 分词 )
|
||||||
@ -384,6 +384,12 @@ Features
|
|||||||
* 2) Full Mode gets all the possible words from the sentence. Fast but not accurate.
|
* 2) Full Mode gets all the possible words from the sentence. Fast but not accurate.
|
||||||
* 3) Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
|
* 3) Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
|
||||||
|
|
||||||
|
Online demo
|
||||||
|
=========
|
||||||
|
http://jiebademo.ap01.aws.af.cm/
|
||||||
|
|
||||||
|
(Powered by Appfog)
|
||||||
|
|
||||||
Usage
|
Usage
|
||||||
========
|
========
|
||||||
* Fully automatic installation: `easy_install jieba` or `pip install jieba`
|
* Fully automatic installation: `easy_install jieba` or `pip install jieba`
|
||||||
@ -403,8 +409,9 @@ Main Functions
|
|||||||
1) : Cut
|
1) : Cut
|
||||||
--------
|
--------
|
||||||
* The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
|
* The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
|
||||||
* `jieba.cut` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode), or `list(jieba.cut( ... ))` to create a list.
|
|
||||||
* `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
|
* `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
|
||||||
|
* The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
|
||||||
|
* `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode), or `list(jieba.cut( ... ))` to create a list.
|
||||||
|
|
||||||
**Code example: segmentation**
|
**Code example: segmentation**
|
||||||
|
|
||||||
@ -610,8 +617,3 @@ Segmentation speed
|
|||||||
* 400 KB / Second in Default Mode
|
* 400 KB / Second in Default Mode
|
||||||
* Test Env: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz;《围城》.txt
|
* Test Env: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz;《围城》.txt
|
||||||
|
|
||||||
Online demo
|
|
||||||
=========
|
|
||||||
http://jiebademo.ap01.aws.af.cm/
|
|
||||||
|
|
||||||
(Powered by Appfog)
|
|
||||||
|
@ -4,11 +4,11 @@ import jieba
|
|||||||
import jieba.posseg
|
import jieba.posseg
|
||||||
import os
|
import os
|
||||||
from operator import itemgetter
|
from operator import itemgetter
|
||||||
|
from .textrank import textrank
|
||||||
try:
|
try:
|
||||||
from .analyzer import ChineseAnalyzer
|
from .analyzer import ChineseAnalyzer
|
||||||
except ImportError:
|
except ImportError:
|
||||||
pass
|
pass
|
||||||
from .textrank import textrank
|
|
||||||
|
|
||||||
_curpath = os.path.normpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
_curpath = os.path.normpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
||||||
abs_path = os.path.join(_curpath, "idf.txt")
|
abs_path = os.path.join(_curpath, "idf.txt")
|
||||||
|
@ -1,11 +1,12 @@
|
|||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
# -*- coding: utf-8 -*-
|
# -*- coding: utf-8 -*-
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import absolute_import, unicode_literals
|
||||||
import sys
|
import sys
|
||||||
import collections
|
import collections
|
||||||
from operator import itemgetter
|
from operator import itemgetter
|
||||||
import jieba.posseg as pseg
|
import jieba.posseg as pseg
|
||||||
|
from .._compat import *
|
||||||
|
|
||||||
|
|
||||||
class UndirectWeightedGraph:
|
class UndirectWeightedGraph:
|
||||||
@ -28,8 +29,9 @@ class UndirectWeightedGraph:
|
|||||||
ws[n] = wsdef
|
ws[n] = wsdef
|
||||||
outSum[n] = sum((e[2] for e in out), 0.0)
|
outSum[n] = sum((e[2] for e in out), 0.0)
|
||||||
|
|
||||||
sorted_keys = sorted(self.graph.keys()) # this line for build stable iteration
|
# this line for build stable iteration
|
||||||
for x in range(10): # 10 iters
|
sorted_keys = sorted(self.graph.keys())
|
||||||
|
for x in xrange(10): # 10 iters
|
||||||
for n in sorted_keys:
|
for n in sorted_keys:
|
||||||
s = 0
|
s = 0
|
||||||
for e in self.graph[n]:
|
for e in self.graph[n]:
|
||||||
@ -38,7 +40,7 @@ class UndirectWeightedGraph:
|
|||||||
|
|
||||||
(min_rank, max_rank) = (sys.float_info[0], sys.float_info[3])
|
(min_rank, max_rank) = (sys.float_info[0], sys.float_info[3])
|
||||||
|
|
||||||
for _, w in ws.items():
|
for w in itervalues(ws):
|
||||||
if w < min_rank:
|
if w < min_rank:
|
||||||
min_rank = w
|
min_rank = w
|
||||||
elif w > max_rank:
|
elif w > max_rank:
|
||||||
@ -66,9 +68,9 @@ def textrank(sentence, topK=10, withWeight=False, allowPOS=['ns', 'n', 'vn', 'v'
|
|||||||
cm = collections.defaultdict(int)
|
cm = collections.defaultdict(int)
|
||||||
span = 5
|
span = 5
|
||||||
words = list(pseg.cut(sentence))
|
words = list(pseg.cut(sentence))
|
||||||
for i in range(len(words)):
|
for i in xrange(len(words)):
|
||||||
if words[i].flag in pos_filt:
|
if words[i].flag in pos_filt:
|
||||||
for j in range(i + 1, i + span):
|
for j in xrange(i + 1, i + span):
|
||||||
if j >= len(words):
|
if j >= len(words):
|
||||||
break
|
break
|
||||||
if words[j].flag not in pos_filt:
|
if words[j].flag not in pos_filt:
|
||||||
|
Loading…
x
Reference in New Issue
Block a user