mirror of
https://github.com/fxsjy/jieba.git
synced 2025-07-10 00:01:33 +08:00
Fix __init__ "-" symbol issue
Solving "-" symbol can't be analyze issue . For example, In keyword , chap-EX喬沛詩 , SK-II ...etc the present version will show "chap", "-", "EX喬沛詩" , "SK", "-", "II" After the modify, The new version will show "chap-EX","喬沛詩" , "SK-II" ps: I have used the jieba.load_userdict() , and added "chap-EX" , "喬沛詩", "SK-II" in the userdict.txt.
This commit is contained in:
parent
7653db2e33
commit
36a27302ce
@ -40,7 +40,10 @@ re_eng = re.compile('[a-zA-Z0-9]', re.U)
|
||||
|
||||
# \u4E00-\u9FD5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
|
||||
# \r\n|\s : whitespace characters. Will not be handled.
|
||||
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
|
||||
# re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
|
||||
# Adding "-" symbol in re_han_default
|
||||
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
|
||||
|
||||
re_skip_default = re.compile("(\r\n|\s)", re.U)
|
||||
re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)
|
||||
re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)
|
||||
|
Loading…
x
Reference in New Issue
Block a user