elasticsearch通过ngram分词机制实现搜索推荐

祈雨的笔记

2018-10-31

转载自简书本文链接地址: Elasticsearch通过ngram分词机制实现搜索推荐

1、什么是ngram

例如英语单词 quick，5种长度下的ngram

ngram length=1，q u i c k
ngram length=2，qu ui ic ck
ngram length=3，qui uic ick
ngram length=4，quic uick
ngram length=5，quick

2、什么是edge ngram

quick这个词，抛锚首字母后进行ngram

q
qu
qui
quic
quick

使用edge ngram将每个单词都进行进一步的分词和切分，用切分后的ngram来实现前缀搜索推荐功能

1 2	hello world hello we

h
he
hel
hell
hello    doc1,doc2

w         doc1,doc2
wo
wor
worl
world
e       doc2

比如搜索hello w

doc1和doc2都匹配hello和w，而且position也匹配，所以doc1和doc2被返回。

搜索的时候，不用在根据一个前缀，然后扫描整个倒排索引了；简单的拿前缀去倒排索引中匹配即可，如果匹配上了，那么就完事了。

3、最大最小参数

1 2	min ngram = 1 max ngram = 3

最小几位最大几位。（这里是最小1位最大3位）

比如有helloworld单词

那么就是如下

1
2
3

h
he
hel

最大三位就停止了。

4、试验一下ngram

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter" : {
          "type" : "edge_ngram",
          "min_gram" : 1,
          "max_gram" : 20
        }
      },
      "analyzer": {
        "autocomplete" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  }
}

PUT /my_index/_mapping/my_type
{
  "properties": {
      "title": {
          "type":     "string",
          "analyzer": "autocomplete",
          "search_analyzer": "standard"
      }
  }
}

注意这里search_analyzer为什么是standard而不是autocomplete？

因为搜索的时候没必要在进行每个字母都拆分，比如搜索hello w。直接拆分成hello和w去搜索就好了，没必要弄成如下这样：

h
he
hel
hell
hello   

w

弄成这样的话效率反而更低了。

插入4条数据

PUT /my_index/my_type/1
{
  "title" : "hello world"
}

PUT /my_index/my_type/2
{
  "title" : "hello we"
}

PUT /my_index/my_type/3
{
  "title" : "hello win"
}

PUT /my_index/my_type/4
{
  "title" : "hello dog"
}

执行搜索

GET /my_index/my_type/_search
{
  "query": {
    "match_phrase": {
      "title": "hello w"
    }
  }
}

结果

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.1983768,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1.1983768,
        "_source": {
          "title": "hello we"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.8271048,
        "_source": {
          "title": "hello world"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 0.797104,
        "_source": {
          "title": "hello win"
        }
      }
    ]
  }
}

本来match_phrase不会分词。只匹配短语，但是为什么这样却能匹配出三条？

是因为我们建立mapping的时候对title进行了分词设置，运用了ngram将他进行了拆分，而搜索的时候按照标准的standard分词器去拆分term，这样效率杠杠的！！