We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
问题描述:把数字和量词的分词设置为 "enable_number_quantifier_recognize": false, 后,自定义的分词器仍然可以分出来数字
环境版本:elasticsearch和hanlp版本号都是 7.10.2
测试过程如下
#建立了自己的分词器 PUT hanlp_index { "settings": { "analysis": { "analyzer": { "52dzhp_hanlp_analyzer": { "type": "hanlp", "enable_custom_config": true, "enable_stop_dictionary": true, "enable_number_quantifier_recognize": false, "enable_custom_dictionary": true, "enable_place_recognize": false } } } } }
#Post了完全匹配关键词的数据测试,没问题
#测试条件1 POST hanlp_index/_analyze { "text":"声环境质量标准GB30962008", "analyzer": "52dzhp_hanlp_analyzer" }
POST hanlp_index/_analyze { "text":"声环境质量标准GB30962008", "analyzer": "52dzhp_hanlp_analyzer" }
#测试结果1 { "tokens" : [ { "token" : "声环境质量标准GB30962008", "start_offset" : 0, "end_offset" : 17, "type" : "eswi", "position" : 0 } ] } ---------测试成功
{ "tokens" : [ { "token" : "声环境质量标准GB30962008", "start_offset" : 0, "end_offset" : 17, "type" : "eswi", "position" : 0 } ] }
#在文本后面加了个数字1做测试
#测试条件2 POST hanlp_index/_analyze { **"text":"声环境质量标准GB309620081",** "analyzer": "52dzhp_hanlp_analyzer" }
POST hanlp_index/_analyze { **"text":"声环境质量标准GB309620081",** "analyzer": "52dzhp_hanlp_analyzer" }
#测试结果里数字被直接拆出来了,不过我已经把数字和量词的分词设置为 "enable_number_quantifier_recognize": false, #测试结果2 { "tokens" : [ { "token" : "声环境质量标准", "start_offset" : 0, "end_offset" : 7, "type" : "esw", "position" : 0 }, { "token" : "GB", "start_offset" : 7, "end_offset" : 9, "type" : "nx", "position" : 1 }, { "token" : "309620081", "start_offset" : 9, "end_offset" : 18, "type" : "m", "position" : 2 } ] } ---------测试失败
{ "tokens" : [ { "token" : "声环境质量标准", "start_offset" : 0, "end_offset" : 7, "type" : "esw", "position" : 0 }, { "token" : "GB", "start_offset" : 7, "end_offset" : 9, "type" : "nx", "position" : 1 }, { "token" : "309620081", "start_offset" : 9, "end_offset" : 18, "type" : "m", "position" : 2 } ] }
@KennFalcon 希望可以咨询一下,谢谢。
The text was updated successfully, but these errors were encountered:
是加了自定义 声环境质量标准GB30962008 吗? 最后解决了问题了吗
Sorry, something went wrong.
No branches or pull requests
问题描述:把数字和量词的分词设置为 "enable_number_quantifier_recognize": false, 后,自定义的分词器仍然可以分出来数字
环境版本:elasticsearch和hanlp版本号都是 7.10.2
测试过程如下
#建立了自己的分词器
PUT hanlp_index
{
"settings": {
"analysis": {
"analyzer": {
"52dzhp_hanlp_analyzer": {
"type": "hanlp",
"enable_custom_config": true,
"enable_stop_dictionary": true,
"enable_number_quantifier_recognize": false,
"enable_custom_dictionary": true,
"enable_place_recognize": false
}
}
}
}
}
#Post了完全匹配关键词的数据测试,没问题
#测试条件1
POST hanlp_index/_analyze { "text":"声环境质量标准GB30962008", "analyzer": "52dzhp_hanlp_analyzer" }
#测试结果1
{ "tokens" : [ { "token" : "声环境质量标准GB30962008", "start_offset" : 0, "end_offset" : 17, "type" : "eswi", "position" : 0 } ] }
---------测试成功#在文本后面加了个数字1做测试
#测试条件2
POST hanlp_index/_analyze { **"text":"声环境质量标准GB309620081",** "analyzer": "52dzhp_hanlp_analyzer" }
#测试结果里数字被直接拆出来了,不过我已经把数字和量词的分词设置为 "enable_number_quantifier_recognize": false,
#测试结果2
{ "tokens" : [ { "token" : "声环境质量标准", "start_offset" : 0, "end_offset" : 7, "type" : "esw", "position" : 0 }, { "token" : "GB", "start_offset" : 7, "end_offset" : 9, "type" : "nx", "position" : 1 }, { "token" : "309620081", "start_offset" : 9, "end_offset" : 18, "type" : "m", "position" : 2 } ] }
---------测试失败@KennFalcon 希望可以咨询一下,谢谢。
The text was updated successfully, but these errors were encountered: