ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

Elasticsearch7.15.2 ik中文分词器 定制化分词器之扩展词库

2021-11-21 12:58:39  阅读:187  来源: 互联网

标签:category 分词器 凯悦 seller ik score 词库 id name


在这里插入图片描述
在这里插入图片描述

背景: IK分词提供的两个分词器,并不支持一些新的词汇,有时候也不能满足实际业务需要,这时候,我们可以定义自定义词库来完成目标。
目标: 定制化中文分词器,使得我们的中文分词器支持扩展的词汇

文章目录

一、搜索现状
1. 搜索关键词
# 搜索凯悦相关的酒店
GET /shop/_search
{
  "query":{
    "match": {"name":"凯悦"}
  }
}
2. 数据结果
{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 3.3362136,
    "hits" : [
      {
        "_index" : "shop",
        "_type" : "_doc",
        "_id" : "9",
        "_score" : 3.3362136,
        "_source" : {
          "price_per_man" : 176,
          "remark_score" : 2.2,
          "category_name" : "酒店",
          "@version" : "1",
          "seller_disabled_flag" : 0,
          "@timestamp" : "2021-11-21T04:10:03.916Z",
          "tags" : "落地大窗",
          "location" : "31.306172,121.525843",
          "seller_remark_score" : 3.0,
          "id" : 9,
          "name" : "凯悦酒店",
          "seller_id" : 17,
          "category_id" : 2
        }
      },
      {
        "_index" : "shop",
        "_type" : "_doc",
        "_id" : "10",
        "_score" : 2.836244,
        "_source" : {
          "price_per_man" : 182,
          "remark_score" : 0.5,
          "category_name" : "酒店",
          "@version" : "1",
          "seller_disabled_flag" : 0,
          "@timestamp" : "2021-11-21T04:10:03.918Z",
          "tags" : "自助餐",
          "location" : "31.196742,121.322846",
          "seller_remark_score" : 3.0,
          "id" : 10,
          "name" : "凯悦嘉轩酒店",
          "seller_id" : 17,
          "category_id" : 2
        }
      },
      {
        "_index" : "shop",
        "_type" : "_doc",
        "_id" : "11",
        "_score" : 2.836244,
        "_source" : {
          "price_per_man" : 74,
          "remark_score" : 1.0,
          "category_name" : "酒店",
          "@version" : "1",
          "seller_disabled_flag" : 0,
          "@timestamp" : "2021-11-21T04:10:03.920Z",
          "tags" : "自助餐",
          "location" : "31.156899,121.238362",
          "seller_remark_score" : 3.0,
          "id" : 11,
          "name" : "新虹桥凯悦酒店",
          "seller_id" : 17,
          "category_id" : 2
        }
      },
      {
        "_index" : "shop",
        "_type" : "_doc",
        "_id" : "12",
        "_score" : 2.638537,
        "_source" : {
          "price_per_man" : 71,
          "remark_score" : 2.0,
          "category_name" : "美食2",
          "@version" : "1",
          "seller_disabled_flag" : 0,
          "@timestamp" : "2021-11-21T04:10:03.923Z",
          "tags" : "有包厢",
          "location" : "30.679819,121.651921",
          "seller_remark_score" : 3.0,
          "id" : 12,
          "name" : "凯悦咖啡(新建西路店)",
          "seller_id" : 17,
          "category_id" : 1
        }
      },
      {
        "_index" : "shop",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.3119392,
        "_source" : {
          "price_per_man" : 152,
          "remark_score" : 2.0,
          "category_name" : "美食2",
          "@version" : "1",
          "seller_disabled_flag" : 0,
          "@timestamp" : "2021-11-21T04:10:03.907Z",
          "tags" : "落地大窗 有WIFI",
          "location" : "31.306419,121.524878",
          "seller_remark_score" : 2.0,
          "id" : 4,
          "name" : "花悦庭果木烤鸭",
          "seller_id" : 2,
          "category_id" : 1
        }
      }
    ]
  }
}
3. 数据分析

上面数据中有一条不符的结果数据,此数据中无**“凯悦”**关键词,但是,搜索后还是显示在页面上,不符合预期搜索结果。

 {
        "_index" : "shop",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.3119392,
        "_source" : {
          "price_per_man" : 152,
          "remark_score" : 2.0,
          "category_name" : "美食2",
          "@version" : "1",
          "seller_disabled_flag" : 0,
          "@timestamp" : "2021-11-21T04:10:03.907Z",
          "tags" : "落地大窗 有WIFI",
          "location" : "31.306419,121.524878",
          "seller_remark_score" : 2.0,
          "id" : 4,
          "name" : "花悦庭果木烤鸭",
          "seller_id" : 2,
          "category_id" : 1
        }
      }
4. ES IK分词
# 查阅凯悦分词
GET /shop/_analyze
{
  "analyzer": "ik_smart",
  "text": "凯悦"
}
5. IK分词结果+分析
{
  "tokens" : [
    {
      "token" : "凯",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "悦",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    }
  ]
}

从上面数据可以看出,使用ik_smart分词api,分词“凯”,“悦”,并没有将“凯悦”关键词当做一个分词元素,主要原因就是,es安装的ik中文分词库中没有将“凯悦”放入分词库。

二、定制化分词器
2.1. 新增分词词典库
cd /app/elasticsearch-7.15.2/config/analysis-ik/
vim new_word.dic

添加自定义分词

凯悦
2.2. 词典配置

使用ik加载我们自定义的分词词典库

vim IKAnalyzer.cfg.xml

内容:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict">new_word.dic</entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

在这里插入图片描述

2.3. 重启es7
ps -ef|grep elasticsearch
kill -9 es进程号
cd /app/elasticsearch-7.15.2/
bin/elasticsearch -d

在这里插入图片描述

2.4. 重新查看分词结果
# 查阅凯悦分词
GET /shop/_analyze
{
  "analyzer": "ik_smart",
  "text": "凯悦"
}


GET /shop/_analyze
{
  "analyzer": "ik_max_word",
  "text": "凯悦"
}

在这里插入图片描述

2.5. 重新搜索
GET /shop/_search
{
  "query":{
    "match": {"name":"凯悦"}
  }
}

在这里插入图片描述

GET /shop/_search

在这里插入图片描述

发现一条数据都没查询出来,但是,数据都还在。

2.6. 重建分词索引

索引创建的时候,是在ik分词器上当时没有“凯悦”这个词的时候。目前,我们凯悦酒店这条记录对应的记录”凯和悦”已经在索引成型,单字的”凯”和单字“悦”。因为在擦黄建索引的时候,并没有做分词的扩展分词库加载。

目前的问题,现在索引中存储的是”凯和悦”分开的,但是我搜索的时候,执行的凯悦,却是按照搜索当前search的分词器,也就是分出来的是”凯和悦”连字存在。

我搜索是2个字,但是倒排索引的时候是按照单字做搜引得,因此导致搜素数据为空。

解决方案:

第一种(第一次):把索引全部删除,然后全量同步分词索引
第二种(推荐):针对搜索的索引中包含“凯“或者“悦“的索引执行重建索引,其他的索引不重建索引。

# 重建凯悦分析索引
POST /shop/_update_by_query
{
  "query": {
    "bool": {
      "must": [
        {"term":{"name":"凯"}},
          {"term":{"name":"悦"}}
      ]
    }
  }
}
2.7. 再次查询
GET /shop/_search
{
  "query":{
    "match": {"name":"凯悦"}
  }
}
2.8. 数据分析
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 2.0709352,
    "hits" : [
      {
        "_index" : "shop",
        "_type" : "_doc",
        "_id" : "9",
        "_score" : 2.0709352,
        "_source" : {
          "price_per_man" : 176,
          "remark_score" : 2.2,
          "category_name" : "酒店",
          "@version" : "1",
          "seller_disabled_flag" : 0,
          "@timestamp" : "2021-11-21T04:10:03.916Z",
          "tags" : "落地大窗",
          "location" : "31.306172,121.525843",
          "seller_remark_score" : 3.0,
          "id" : 9,
          "name" : "凯悦酒店",
          "seller_id" : 17,
          "category_id" : 2
        }
      },
      {
        "_index" : "shop",
        "_type" : "_doc",
        "_id" : "10",
        "_score" : 1.7177677,
        "_source" : {
          "price_per_man" : 182,
          "remark_score" : 0.5,
          "category_name" : "酒店",
          "@version" : "1",
          "seller_disabled_flag" : 0,
          "@timestamp" : "2021-11-21T04:10:03.918Z",
          "tags" : "自助餐",
          "location" : "31.196742,121.322846",
          "seller_remark_score" : 3.0,
          "id" : 10,
          "name" : "凯悦嘉轩酒店",
          "seller_id" : 17,
          "category_id" : 2
        }
      },
      {
        "_index" : "shop",
        "_type" : "_doc",
        "_id" : "11",
        "_score" : 1.7177677,
        "_source" : {
          "price_per_man" : 74,
          "remark_score" : 1.0,
          "category_name" : "酒店",
          "@version" : "1",
          "seller_disabled_flag" : 0,
          "@timestamp" : "2021-11-21T04:10:03.920Z",
          "tags" : "自助餐",
          "location" : "31.156899,121.238362",
          "seller_remark_score" : 3.0,
          "id" : 11,
          "name" : "新虹桥凯悦酒店",
          "seller_id" : 17,
          "category_id" : 2
        }
      },
      {
        "_index" : "shop",
        "_type" : "_doc",
        "_id" : "12",
        "_score" : 1.5828056,
        "_source" : {
          "price_per_man" : 71,
          "remark_score" : 2.0,
          "category_name" : "美食2",
          "@version" : "1",
          "seller_disabled_flag" : 0,
          "@timestamp" : "2021-11-21T04:10:03.923Z",
          "tags" : "有包厢",
          "location" : "30.679819,121.651921",
          "seller_remark_score" : 3.0,
          "id" : 12,
          "name" : "凯悦咖啡(新建西路店)",
          "seller_id" : 17,
          "category_id" : 1
        }
      }
    ]
  }
}

从上面数据可以看出,符合预期结果!

标签:category,分词器,凯悦,seller,ik,score,词库,id,name
来源: https://blog.csdn.net/weixin_40816738/article/details/121450893

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有