ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

elasticsearch synonym filter 使用思考

2021-03-19 22:01:13  阅读:255  来源: 互联网

标签:synonym description title value filter elasticsearch details analyzer


ES synonym filter

为了进行扩为了进行扩召回,一种有效的方式是添加同义词,加入同义词后扩大了搜索范围同时也带来了两个问题:

  1. term query 原词需要比同义词有更高的评分

    # 发现结果中 原词和同义词 具有同样的权值
    GET learning_test_03/_search
    {
      "_source": "post_title", 
      "explain": true, 
      "query": {
        "term": {
          "post_title.jieba_dic_all_synonym": {
            "value": "视图"
          }
        }
      }
    }
    
  2. match_phase 也有这个问题,同义词低于原词的评分

GET learning_test_03/_search
{
  "_source": "post_title",
  "explain": true, 
  "query": {
      "match_phrase": {
          "post_title.jieba_dic_all_synonym": "插入流程图"
      }
  }
}

# result:
#  "description" : """weight(post_title.jieba_dic_all_synonym:"插入 (流程图 visio 略图 视图)" in 20533) [PerFieldSimilarity], result of:"""
# 可以看出,流程图 和他的同义词: visio 略图 视图 ,身份都是一样的。但是在查询中,往往应该原词高于 扩充的同义词.


synonym 对评分的干扰

带有 synonym filter 的 analyzer 的使用:

官方文档提供了 synonym filter 并举例了 ,索引数据时的应用示例, 但是经过调研分析,得出了 带有 synonym 的 analyzer 适用于 search 而不是 index。

  1. synonym 增加了field 的 term 数量(导致评分参数 avgdl 变大), 还有重要的是 如果使用 match query 的话,会导致 匹配的 termFreq 增加到 synonym 的数量,影响评分。
  2. 如果 同义词变化的话,需要同步更新所有的关系到同义词的文档。
  3. 对于匹配原词 和 他的同义词,往往原词的 评分应该更高。但是 ES 中却一视同仁。没有区别。虽然可以通过定义不同的 field ,一个 field 使用 完全切分,一个field 使用同义词,并且在search时,给 全完且分词field 一个较高的权重。但是又带来了怎加了term 存储的容量扩大问题。

使用 demo 说明:

同义词文件内容:
工作,简历,招聘,入职
学校,老师,学生,操场
医院,护士,医生

PUT /test_synonym_1
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "jieba_synonym": {
            "tokenizer": "jieba_search",
            "filter": [
              "synonym"
            ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms_path": "synonyms/synonyms.txt"
          }
        }
      }
    }
  }
}

PUT test_synonym_1/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer" :"jieba_synonym"
    },
    "content" :{
      "type" :"text",
      "analyzer" :"jieba_search" 
    }
  }
}

POST test_synonym_1/_doc/1
{
  "title" :"插入流程图时怎么编辑工作",
  "content":"插入流程图时怎么编辑工作"
}

POST test_synonym_1/_doc/2
{
  "title" :"怎么自定义功能区的学校",
  "content":"怎么自定义功能区的学校"
}
POST test_synonym_1/_doc/3
{
  "title" :"如何在表格中加医院",
  "content":"如何在表格中加医院"
}
POST test_synonym_1/_doc/4
{
  "title" :"首页怎么关闭?",
  "content":"首页怎么关闭?"
}
POST test_synonym_1/_doc/5
{
  "title" :"修改的关系图怎么做成一整个图",
  "content":"修改的关系图怎么做成一整个图"
}
POST test_synonym_1/_doc/6
{
  "title" :"在哪里给文档命名",
  "content":"在哪里给文档命名"
}

GET test_synonym_1/_search
{
  "explain": true,
  "query": {
    "match": {
      "title": {
        "query": "表格学生",
        "analyzer": "jieba_synonym"
      }
    }
  }
}

# result
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.7425265,
    "hits" : [
      {
        "_shard" : "[test_synonym_1][0]",
        "_node" : "EyNKn90XS1Otize_1yE7-w",
        "_index" : "test_synonym_1",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 2.7425265,
        "_source" : {
          "title" : "怎么自定义功能区的学校",
          "content" : "怎么自定义功能区的学校"
        },
        "_explanation" : {
          "value" : 2.7425265,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 2.7425265,
              "description" : "weight(Synonym(title:学校 title:学生 title:操场 title:老师) in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 2.7425265,
                  "description" : "score(freq=4.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.5404451,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 6,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.80924857,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 4.0,
                          "description" : "termFreq=4.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 7.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[test_synonym_1][0]",
        "_node" : "EyNKn90XS1Otize_1yE7-w",
        "_index" : "test_synonym_1",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.7443275,
        "_source" : {
          "title" : "如何在表格中加医院",
          "content" : "如何在表格中加医院"
        },
        "_explanation" : {
          "value" : 1.7443275,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 1.7443275,
              "description" : "weight(title:表格 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 1.7443275,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.5404451,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 6,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.5147059,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 7.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}


# 可以看到 第一个得分较高,并且 termFreq=4.0 恰好是 同义词的数量。这是因为在搜索的时候也使用了 synonym ,对原有的query 进行了扩充。
# 使用 term 是没有这个问题的,因为 term query 不会对 搜索词 进行 analyzer的加工处理。但是没有办法保证精确匹配的原词有更高的 score,而不是匹配上的其他同义词有更高 score, 比如 query:学校 ,结果是  (一个学生, 一个拥有大量面积的学校) ,而不是精准匹配的在前面。   

上述问题的解决思考:

不要使用 带 synonym 的analyzer 进行 index 操作,使用他们进行 query 操作。

# analyzer 在数据索引的事后起作用
# search_analyzer 在请求的时候起作用,如果没有默认是 analyzer 
PUT test_synonym_1/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "jieba_search",
      "search_analyzer": "jieba_synonym"
    },
    "content" :{
      "type" :"text",
      "analyzer" :"jieba_search" 
    }
  }
}

#更加灵活的处理是 在 match query 是指定相应的 analyzer 
GET test_synonym_1/_search
{
  "explain": true,
  "query": {
    "match": {
      "title": {
        "query": "表格学生",
        "analyzer": "jieba_synonym"
      }
    }
  }
}

最后 ,如果 synonym filter本身支持远程词库的作用的话,那么更新了远程词库,搜索的时候就会主动生效。

# 使用远程词库的 synonym filter, 拼接起来的 analyzer 去 search
PUT /test_synonym_1
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "jieba_synonym": {
            "tokenizer": "jieba_search",
            "filter": [
              "remote_synonym"
            ]
          }
        },
        "filter": {
          "remote_synonym": {
            "type": "dynamic_synonym",
            "synonyms_path": "http://locahost:8080/synonym.txt",
            "interval": "60"
          }
        }
      }
    }
  }
}

PUT test_synonym_1/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "jieba_search",
      "search_analyzer": "jieba_synonym"
    },
    "content" :{
      "type" :"text",
      "analyzer" :"jieba_search" 
    }
  }
}

标签:synonym,description,title,value,filter,elasticsearch,details,analyzer
来源: https://blog.csdn.net/goodluck_mh/article/details/115016615

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有