elasticsearch实现like查询

问题

elasticsearch查询需要实现类似于mysql的like查询效果,例如值为hello中国233的记录,即可以通过中国查询出记录,也可以通过llo查询出记录。

但是elasticsearch的查询都是基于分词查询,hello中国233会默认分词为hello233。当使用hello查询时可以匹配到该记录,但是使用llo查询时,匹配不到该记录。

解决

由于记录内容分词的结果的粒度不够细,导致分词查询匹配不到记录,因此解决方案是将记录内容以每个字符进行分词。即把hello中国233分词为helo23

elasticsearch默认没有如上效果的分词器,可以通过自定义分词器实现该效果:通过字符过滤器,将字符串的每一个字符间添加一个空格,再使用空格分词器将字符串拆分成字符。

效果

默认分词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
PUT /like_search
{
"mappings": {
"like_search_type": {
"properties": {
"name": {
"type": "text"
}
}
}
}
}

PUT /like_search/like_search_type/1
{
"name": "hello中国233"
}

分词效果

1
2
3
4
5
6
GET /like_search/_analyze
{
"text": [
"hello中国233"
]
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "中",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "国",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "233",
"start_offset": 7,
"end_offset": 10,
"type": "<NUM>",
"position": 3
}
]
}

elasticsearch默认使用standard分词器,如下通过llo查询不到hello中国233的记录。

1
2
3
4
5
6
7
8
GET /like_search/_search
{
"query": {
"match_phrase": {
"name": "llo"
}
}
}

自定义分词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
PUT /like_search
{
"settings": {
"analysis": {
"analyzer": {
"char_analyzer": {
"char_filter": [
"split_by_whitespace_filter"
],
"tokenizer": "whitespace"
}
},
"char_filter": {
"split_by_whitespace_filter": {
"type": "pattern_replace",
"pattern": "(.+?)",
"replacement": "$1 "
}
}
}
},
"mappings": {
"like_search_type": {
"properties": {
"name": {
"type": "text",
"analyzer": "char_analyzer"
}
}
}
}
}

PUT /like_search/like_search_type/1
{
"name": "hello中国233"
}

分词效果

1
2
3
4
5
6
7
GET /like_search/_analyze
{
"analyzer": "char_analyzer",
"text": [
"hello中国233"
]
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
{
"tokens": [
{
"token": "h",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "e",
"start_offset": 1,
"end_offset": 1,
"type": "word",
"position": 1
},
{
"token": "l",
"start_offset": 2,
"end_offset": 2,
"type": "word",
"position": 2
},
{
"token": "l",
"start_offset": 3,
"end_offset": 3,
"type": "word",
"position": 3
},
{
"token": "o",
"start_offset": 4,
"end_offset": 4,
"type": "word",
"position": 4
},
{
"token": "中",
"start_offset": 5,
"end_offset": 5,
"type": "word",
"position": 5
},
{
"token": "国",
"start_offset": 6,
"end_offset": 6,
"type": "word",
"position": 6
},
{
"token": "2",
"start_offset": 7,
"end_offset": 7,
"type": "word",
"position": 7
},
{
"token": "3",
"start_offset": 8,
"end_offset": 8,
"type": "word",
"position": 8
},
{
"token": "3",
"start_offset": 9,
"end_offset": 9,
"type": "word",
"position": 9
}
]
}

使用自定义的分词器,如下通过llo可以查询到hello中国233的记录。

1
2
3
4
5
6
7
8
GET /like_search/_search
{
"query": {
"match_phrase": {
"name": "llo"
}
}
}