ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

[Feature][Elasticsearch] 为什么内容相同的 document,query match 得到的 score 却不一样?

2021-04-14 19:05:21  阅读:184  来源: 互联网

标签:1.0 description doc Feature value score Elasticsearch details luzhe


结论

因为 score 的计算范围是单个 shard,而不是整个 index。

在 tf/idf 算法中,需要计算 docFreq 和 docCount。
基本上只要 shard 不同,得到的结果就不太可能一样,于是最后得到的 score 也会不一样。

测试步骤

1、建 index,定制 mapping

PUT luzhe
{
  "settings": {
    "index": {
      "number_of_shards": 2
    },
    "analysis": {
      "filter": {
        "my_ngram": {
          "type": "ngram",
          "min_gram": "1",
          "max_gram": "20"
        }
      },
      "analyzer": {
        "ngram_analyzer": {
          "filter": [
            "lowercase",
            "my_ngram"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "analyzer": "ngram_analyzer",
          "term_vector": "with_positions_offsets",
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 50,
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

注意事项:

  • index.number_of_shards = 2,是为了尽快达到不同分片文档数不一致的情况。

2、插入 document(基本上插 3 条就够用了)

PUT luzhe/_doc/1
{
  "title": "大学语文"
}

PUT luzhe/_doc/2
{
  "title": "大学语文"
}

PUT luzhe/_doc/3
{
  "title": "大学语文"
}

3、查看两个 shard 的文档数

GET luzhe/_doc/_search?preference=_shards:0
GET luzhe/_doc/_search?preference=_shards:1

备注:在我的测试环境里,正好shard 0里有一个文档,shard 1里有两个文档。
只要俩分片文档数都不为0且不一样多就可以了。

4、测试

GET luzhe/_search
{
  "query": {
    "match": {
      "title": "语文"
    }
  }
}

输出:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "luzhe",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.5753642,
        "_source" : {
          "title" : "大学语文"
        }
      },
      {
        "_index" : "luzhe",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.36464313,
        "_source" : {
          "title" : "大学语文"
        }
      },
      {
        "_index" : "luzhe",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.36464313,
        "_source" : {
          "title" : "大学语文"
        }
      }
    ]
  }
}

5、explain score

GET luzhe/_doc/3/_explain
{
  "query": {
    "match": {
      "title": "语文"
    }
  }
}

输出

{
  "_index" : "luzhe",
  "_type" : "_doc",
  "_id" : "3",
  "matched" : true,
  "explanation" : {
    "value" : 0.5753642,
    "description" : "sum of:",
    "details" : [
      {
        "value" : 0.2876821,
        "description" : "weight(title:语 in 0) [PerFieldSimilarity], result of:",
        "details" : [
          {
            "value" : 0.2876821,
            "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
            "details" : [
              {
                "value" : 0.2876821,
                "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "docFreq",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.0,
                    "description" : "docCount",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 1.0,
                "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "termFreq=1.0",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "parameter k1",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "parameter b",
                    "details" : [ ]
                  },
                  {
                    "value" : 4.0,
                    "description" : "avgFieldLength",
                    "details" : [ ]
                  },
                  {
                    "value" : 4.0,
                    "description" : "fieldLength",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value" : 0.2876821,
        "description" : "weight(title:文 in 0) [PerFieldSimilarity], result of:",
        "details" : [
          {
            "value" : 0.2876821,
            "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
            "details" : [
              {
                "value" : 0.2876821,
                "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "docFreq",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.0,
                    "description" : "docCount",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 1.0,
                "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "termFreq=1.0",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "parameter k1",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "parameter b",
                    "details" : [ ]
                  },
                  {
                    "value" : 4.0,
                    "description" : "avgFieldLength",
                    "details" : [ ]
                  },
                  {
                    "value" : 4.0,
                    "description" : "fieldLength",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

会发现不同的id,docFreq 和 docCount 的结果不同。

标签:1.0,description,doc,Feature,value,score,Elasticsearch,details,luzhe
来源: https://www.cnblogs.com/luzhe/p/14659381.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有