Home > front end >  Elasticsearch Stop Token Filter Not Working
Elasticsearch Stop Token Filter Not Working

Time:12-16

I've created an index in Elasticsearch 7.10 that looks something like this:

{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "description": {
        "type": "text"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "lowercase",
            "stemmer",
            "stop"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  }
}

As you can see, I've configured a custom analyzer called my_analyzer that has the stop token filter applied to it. Based on the documentation, I would expect this filter to remove English-language stopwords from all text type attributes of the document at index time.

Indeed, if I send a POST request to http://localhost:30200/my_index/_analyze with this request body:

{
  "analyzer": "my_analyzer",
  "text": "If you are a horse, I do not want that cake"
}

I get a response that demonstrates that the tokens if, a, not, and that were removed from the supplied text:

{
    "tokens": [
        {
            "token": "you",
            "start_offset": 3,
            "end_offset": 6,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "ar",
            "start_offset": 7,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "hors",
            "start_offset": 13,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "i",
            "start_offset": 20,
            "end_offset": 21,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "do",
            "start_offset": 22,
            "end_offset": 24,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "want",
            "start_offset": 29,
            "end_offset": 33,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "cake",
            "start_offset": 39,
            "end_offset": 43,
            "type": "<ALPHANUM>",
            "position": 10
        }
    ]
}

However, if I index a document whose description attribute contains the string "If you are a horse, I do not want that cake", and then query the index by making a GET request to http://localhost:30200/my_index/_search with this request body:

{
  "query": {
    "multi_match" : {
      "query": "that", 
      "fields": ["description"]
    }
  }
}

The document is returned, even though the word "that" was supposed to have been removed by the analyzer:

{
    "took": 3,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "27ibobulhqhc7s96jbz6653ud",
                "_score": 0.2876821,
                "_source": {
                    "id": "27ibobulhqhc7s96jbz6653ud",
                    "title": "muscular yak",
                    "description": "If you are a horse, I do not want that cake"
                }
            }
        ]
    }
}

So what gives? If the stop filter is stripping English-language stopwords from indexed text attributes, I would expect querying one of those stop words to return zero results. Do I have to explicitly tell Elasticsearch to use my_analyzer when indexing documents or when processing queries?

For what it's worth, the other filters that I have configured (lowercase and stemmer) appear to work as expected. It's just stop that is giving me trouble.

CodePudding user response:

You are almost there. You just need to map your description field with the customer analyzer you created, as shown below. This would ensure that the content of the description field use my_analyzer at index as well as search time.

{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "description": {
        "type": "text",
        "analyzer": "my_analyzer"          // note this
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "lowercase",
            "stemmer",
            "stop"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  }
}
  • Related