Home > Software design >  Problem bringing string part in ElasticSearch
Problem bringing string part in ElasticSearch

Time:11-22

I'm having a problem getting string part in ElasticSearch. Below is the configuration of index.

PUT exemplo
{
    "settings": {
    "analysis": {
      "analyzer": {
        "portuguese_br": {
          "type": "portuguese"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "integer"
      },
      "content": {
        "type": "text",
        "analyzer": "portuguese_br"
      }
    }
  }
}

There is a document with the following content in the index "exemplo":

h2 style margin 0 0 8px font size 16px color 064a7a 1 Síntese Resumo Descrição do cliente h2 div id headertipodocumento1 style min height 40px position relative class editable mce content body contenteditable true spellcheck false p eu encaminho uma Carta ao ReI

I can't get the document with the following request:

GET exemplo/_search
{
  "from": 0,
  "size": 1,
  "query": {
    "bool": {
      "must": [
        {"regexp": {"content": ".*caminho.*"}}
      ]
    }
  }
}

There is a part of content with the word "encaminho". I'm searching for "caminho" and not getting any result.

Am I doing something wrong in regexp?

CodePudding user response:

The token generate in field content for term "encaminhado" is "encaminh". When you try search by term ".caminho." there is no match.

If you try {"regexp": {"content": ".caminh."}} you get the document.

Another option is fuzziness. Like this query:

   {
      "match": {
        "content": {
          "query": "caminho",
          "fuzziness": "AUTO"
        }
      }
    }

That way you will also get results.

CodePudding user response:

To understand how Elasticsearch analyzes your text you can use the following API.

GET exemplo/_analyze
{
"text": ["h2 style margin 0 0 8px font size 16px color 064a7a 1 Síntese 
  Resumo Descrição do cliente h2 div id headertipodocumento1 style min 
  height 40px position relative class editable mce content body 
  contenteditable true spellcheck false p eu encaminho uma Carta ao ReI"],
"analyzer": "portuguese_br"
}

The output of encaminho part will be like the following:

{
  "token" : "encaminh",
  "start_offset" : 236,
  "end_offset" : 245,
  "type" : "<ALPHANUM>",
  "position" : 38
},
{
  "token" : "cart",
  "start_offset" : 250,
  "end_offset" : 255,
  "type" : "<ALPHANUM>",
  "position" : 40
},
{
  "token" : "rei",
  "start_offset" : 259,
  "end_offset" : 262,
  "type" : "<ALPHANUM>",
  "position" : 42
}

After the analyzer encaminho text is transformed to encaminh, when you search caminho it's not matching with encaminh. What you can do?

  1. You can search as you indexed (look additional notes)
  2. You can add an ngram analyzer feature into your existing analyzer
  3. You can use fuzziness queries during the search

Additional notes: Data analysis is performed during indexing. But data is not analyzed during the query because you are using a wildcard (regex) query. If you can use a match or multi_match your queries will match. Also, match queries are faster than wildcard queries.

GET exemplo/_search
{
  "from": 0,
  "query": {
    "match": {
      "content": "encaminho"
    }
  }
}
  • Related