I'm having a problem getting string part in ElasticSearch. Below is the configuration of index.
PUT exemplo
{
"settings": {
"analysis": {
"analyzer": {
"portuguese_br": {
"type": "portuguese"
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"content": {
"type": "text",
"analyzer": "portuguese_br"
}
}
}
}
There is a document with the following content in the index "exemplo":
h2 style margin 0 0 8px font size 16px color 064a7a 1 Síntese Resumo Descrição do cliente h2 div id headertipodocumento1 style min height 40px position relative class editable mce content body contenteditable true spellcheck false p eu encaminho uma Carta ao ReI
I can't get the document with the following request:
GET exemplo/_search
{
"from": 0,
"size": 1,
"query": {
"bool": {
"must": [
{"regexp": {"content": ".*caminho.*"}}
]
}
}
}
There is a part of content with the word "encaminho". I'm searching for "caminho" and not getting any result.
Am I doing something wrong in regexp?
CodePudding user response:
The token generate in field content for term "encaminhado" is "encaminh". When you try search by term ".caminho." there is no match.
If you try {"regexp": {"content": ".caminh."}} you get the document.
Another option is fuzziness. Like this query:
{
"match": {
"content": {
"query": "caminho",
"fuzziness": "AUTO"
}
}
}
That way you will also get results.
CodePudding user response:
To understand how Elasticsearch analyzes your text you can use the following API.
GET exemplo/_analyze
{
"text": ["h2 style margin 0 0 8px font size 16px color 064a7a 1 Síntese
Resumo Descrição do cliente h2 div id headertipodocumento1 style min
height 40px position relative class editable mce content body
contenteditable true spellcheck false p eu encaminho uma Carta ao ReI"],
"analyzer": "portuguese_br"
}
The output of encaminho
part will be like the following:
{
"token" : "encaminh",
"start_offset" : 236,
"end_offset" : 245,
"type" : "<ALPHANUM>",
"position" : 38
},
{
"token" : "cart",
"start_offset" : 250,
"end_offset" : 255,
"type" : "<ALPHANUM>",
"position" : 40
},
{
"token" : "rei",
"start_offset" : 259,
"end_offset" : 262,
"type" : "<ALPHANUM>",
"position" : 42
}
After the analyzer encaminho text is transformed to encaminh, when you search caminho it's not matching with encaminh. What you can do?
- You can search as you indexed (look additional notes)
- You can add an ngram analyzer feature into your existing analyzer
- You can use fuzziness queries during the search
Additional notes: Data analysis is performed during indexing. But data is not analyzed during the query because you are using a wildcard (regex) query. If you can use a match or multi_match your queries will match. Also, match queries are faster than wildcard queries.
GET exemplo/_search
{
"from": 0,
"query": {
"match": {
"content": "encaminho"
}
}
}