Home > Software design >  Elasticsearch shows match with special character with only .raw
Elasticsearch shows match with special character with only .raw

Time:02-23

I started working on Elasticsearch few days back and I created some analyzers and mappings and have successfully inserted some data in it. The problem occurs when I try to query the data which has some special characters in it. Initially I was using standard analyzer, but after reading about some more options, I settled on whitespace because that tokenizes special characters as well. However, I still cannot query the data. But, if I put field.raw (where field is the actual property of the object), I get the results that I need. But, .raw bypasses the analyzers and I'm wondering whether it might defeat the purpose of it all. Since whitespace didn't work for me, I reverted to the standard one.

Here's the analyzer I built:

PUT demoindex
{
  "settings": {
    "analysis": {
      "filter": {
        "ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        },
        "splcharfilter": {
          "type": "pattern_capture",
          "preserve_original": true,
          "patterns": [
            "([?/-])"
          ]
        }
      },
      "analyzer": {
        "my_field_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ngram",
            "splcharfilter"
          ]
        }
      }
    }
  }
}

Mapping I built:

PUT demoindex/_mapping
{
  "properties": {
    "name": {
      "type": "text",
      "analyzer": "my_field_analyzer",
      "search_analyzer": "simple",
      "fields": {
        "raw": {
          "type": "keyword"
        }
      }
    },
    "area": {
      "type": "text",
      "analyzer": "my_field_analyzer",
      "search_analyzer": "simple",
      "fields": {
        "raw": {
          "type": "keyword"
        }
      }
    }
  }
}

Query that doesn't work:

GET /demoindex/_search?pretty
{
  "from": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "area": {
              "value": "is - application"
            }
          }
        },
        {
          "term": {
            "name": {
              "value": "hem"
            }
          }
        }
      ]
    }
  },
  "size": 15
}

Query that WORKS:

GET /demoindex/_search?pretty
{
  "from": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "area.raw": {
              "value": "is - application"
            }
          }
        },
        {
          "term": {
            "name": {
              "value": "hem"
            }
          }
        }
      ]
    }
  },
  "size": 15
}

As you can notice, I had to use area.raw for it to match the content and return the document. Since name shouldn't have any of the special characters, it should be fine without .raw, but the other fields will have the special characters which might not be limited to -.

So, could someone please point out what I've done wrong or what I'm interpreting wrong? Or is there a better way to achieve this?

P.S: Version info

Elasticsearch : 7.10.1

Lucene : 8.7.0

CodePudding user response:

  1. keyword field are NOT analyzed.
  2. Text field are analyzed.

To check how those are analyzed and what all token are generated, you can use the "Analyze API" from Elasticsearch.

In your case:

POST demoindex/_analyze
{
  "text": ["is - application"],
  "field": "area"
}

and it will output

{
  "tokens" : [
    {
      "token" : "i"
    },
    {
      "token" : "is"
    },
    {
      "token" : "a"
    },
    {
      "token" : "ap"
    },
    {
      "token" : "app"
    },
    {
      "token" : "appl"
    },
    {
      "token" : "appli"
    },
    {
      "token" : "applic"
    },
    {
      "token" : "applica"
    },
    {
      "token" : "applicat"
    },
    {
      "token" : "applicati"
    },
    {
      "token" : "applicatio"
    },
    {
      "token" : "application"
    }
  ]
}

So when you provide the value area.raw:"is - application" as it's keyword type, it will be saved as it is, hence your below Term query works.

Term queries are used for exact matching and should be used with field which are not analyzed like area.raw which is keyword in your case.

GET /demoindex/_search?pretty
{
  "from": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "area.raw": {
              "value": "is - application"
            }
          }
        }
      ]
    }
  },
  "size": 15
}

But when you apply the same Term query on text field, it doesn't work as it's trying to exactly match the value provided, but as we seen above that area value has been tokenized,

So, as suggested by Elasticsearch it's always better to User "Match" query for the text (analyzed field). So belwo query will produce the same result

GET /demoindex/_search?pretty
{
  "from": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "area": {
              "query": "is - application"
            }
          }
        }
      ]
    }
  },
  "size": 15
}
  • Related