自定义分词设置未生效 #135

sam6513 · 2022-05-09T10:53:01Z

问题描述：把数字和量词的分词设置为 "enable_number_quantifier_recognize": false, 后，自定义的分词器仍然可以分出来数字

环境版本：elasticsearch和hanlp版本号都是 7.10.2

测试过程如下

#建立了自己的分词器
PUT hanlp_index
{
"settings": {
"analysis": {
"analyzer": {
"52dzhp_hanlp_analyzer": {
"type": "hanlp",
"enable_custom_config": true,
"enable_stop_dictionary": true,
"enable_number_quantifier_recognize": false,
"enable_custom_dictionary": true,
"enable_place_recognize": false
}
}
}
}
}

#Post了完全匹配关键词的数据测试，没问题

#测试条件1
POST hanlp_index/_analyze { "text":"声环境质量标准GB30962008", "analyzer": "52dzhp_hanlp_analyzer" }

#测试结果1
{ "tokens" : [ { "token" : "声环境质量标准GB30962008", "start_offset" : 0, "end_offset" : 17, "type" : "eswi", "position" : 0 } ] } ---------测试成功

#在文本后面加了个数字1做测试

#测试条件2
POST hanlp_index/_analyze { **"text":"声环境质量标准GB309620081",** "analyzer": "52dzhp_hanlp_analyzer" }

#测试结果里数字被直接拆出来了，不过我已经把数字和量词的分词设置为 "enable_number_quantifier_recognize": false,
#测试结果2
{ "tokens" : [ { "token" : "声环境质量标准", "start_offset" : 0, "end_offset" : 7, "type" : "esw", "position" : 0 }, { "token" : "GB", "start_offset" : 7, "end_offset" : 9, "type" : "nx", "position" : 1 }, { "token" : "309620081", "start_offset" : 9, "end_offset" : 18, "type" : "m", "position" : 2 } ] } ---------测试失败

@KennFalcon 希望可以咨询一下，谢谢。

The text was updated successfully, but these errors were encountered:

chunpat · 2023-05-18T05:59:14Z

是加了自定义声环境质量标准GB30962008 吗？最后解决了问题了吗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

自定义分词设置未生效 #135

自定义分词设置未生效 #135

sam6513 commented May 9, 2022 •

edited

Loading

chunpat commented May 18, 2023

自定义分词设置未生效 #135

自定义分词设置未生效 #135

Comments

sam6513 commented May 9, 2022 • edited Loading

chunpat commented May 18, 2023

sam6513 commented May 9, 2022 •

edited

Loading