elasticsearch实现like查询

祈雨的笔记

2019-05-09

问题

elasticsearch查询需要实现类似于mysql的like查询效果，例如值为hello中国233的记录，即可以通过中国查询出记录，也可以通过llo查询出记录。

但是elasticsearch的查询都是基于分词查询，hello中国233会默认分词为hello、中、国、233。当使用hello查询时可以匹配到该记录，但是使用llo查询时，匹配不到该记录。

解决

由于记录内容分词的结果的粒度不够细，导致分词查询匹配不到记录，因此解决方案是将记录内容以每个字符进行分词。即把hello中国233分词为h、e、l、o、中、国、2、3。

elasticsearch默认没有如上效果的分词器，可以通过自定义分词器实现该效果：通过字符过滤器，将字符串的每一个字符间添加一个空格，再使用空格分词器将字符串拆分成字符。

效果

默认分词

PUT /like_search
{
  "mappings": {
    "like_search_type": {
      "properties": {
        "name": {
          "type": "text"
        }
      }
    }
  }
}

PUT /like_search/like_search_type/1
{
  "name": "hello中国233"
}

分词效果

GET /like_search/_analyze
{
  "text": [
    "hello中国233"
    ]
}

{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "中",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "国",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "233",
      "start_offset": 7,
      "end_offset": 10,
      "type": "<NUM>",
      "position": 3
    }
  ]
}

elasticsearch默认使用standard分词器，如下通过llo查询不到hello中国233的记录。

GET /like_search/_search
{
  "query": {
    "match_phrase": {
      "name": "llo"
    }
  }
}

自定义分词

PUT /like_search
{
  "settings": {
    "analysis": {
      "analyzer": {
        "char_analyzer": {
          "char_filter": [
            "split_by_whitespace_filter"
          ],
          "tokenizer": "whitespace"
        }
      },
      "char_filter": {
        "split_by_whitespace_filter": {
          "type": "pattern_replace",
          "pattern": "(.+?)",
          "replacement": "$1 "
        }
      }
    }
  },
  "mappings": {
    "like_search_type": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "char_analyzer"
        }
      }
    }
  }
}

PUT /like_search/like_search_type/1
{
  "name": "hello中国233"
}

分词效果

GET /like_search/_analyze
{
  "analyzer": "char_analyzer", 
  "text": [
    "hello中国233"
    ]
}

{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "e",
      "start_offset": 1,
      "end_offset": 1,
      "type": "word",
      "position": 1
    },
    {
      "token": "l",
      "start_offset": 2,
      "end_offset": 2,
      "type": "word",
      "position": 2
    },
    {
      "token": "l",
      "start_offset": 3,
      "end_offset": 3,
      "type": "word",
      "position": 3
    },
    {
      "token": "o",
      "start_offset": 4,
      "end_offset": 4,
      "type": "word",
      "position": 4
    },
    {
      "token": "中",
      "start_offset": 5,
      "end_offset": 5,
      "type": "word",
      "position": 5
    },
    {
      "token": "国",
      "start_offset": 6,
      "end_offset": 6,
      "type": "word",
      "position": 6
    },
    {
      "token": "2",
      "start_offset": 7,
      "end_offset": 7,
      "type": "word",
      "position": 7
    },
    {
      "token": "3",
      "start_offset": 8,
      "end_offset": 8,
      "type": "word",
      "position": 8
    },
    {
      "token": "3",
      "start_offset": 9,
      "end_offset": 9,
      "type": "word",
      "position": 9
    }
  ]
}

使用自定义的分词器，如下通过llo可以查询到hello中国233的记录。

GET /like_search/_search
{
  "query": {
    "match_phrase": {
      "name": "llo"
    }
  }
}