I've created an index in Elasticsearch 7.10 that looks something like this:
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"title": {
"type": "text"
},
"description": {
"type": "text"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"stemmer",
"stop"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}
As you can see, I've configured a custom analyzer called my_analyzer
that has the stop
token filter applied to it. Based on the documentation, I would expect this filter to remove English-language stopwords from all text
type attributes of the document at index time.
Indeed, if I send a POST
request to http://localhost:30200/my_index/_analyze with this request body:
{
"analyzer": "my_analyzer",
"text": "If you are a horse, I do not want that cake"
}
I get a response that demonstrates that the tokens if
, a
, not
, and that
were removed from the supplied text:
{
"tokens": [
{
"token": "you",
"start_offset": 3,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "ar",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "hors",
"start_offset": 13,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "i",
"start_offset": 20,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "do",
"start_offset": 22,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "want",
"start_offset": 29,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "cake",
"start_offset": 39,
"end_offset": 43,
"type": "<ALPHANUM>",
"position": 10
}
]
}
However, if I index a document whose description
attribute contains the string "If you are a horse, I do not want that cake", and then query the index by making a GET
request to http://localhost:30200/my_index/_search with this request body:
{
"query": {
"multi_match" : {
"query": "that",
"fields": ["description"]
}
}
}
The document is returned, even though the word "that" was supposed to have been removed by the analyzer:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "27ibobulhqhc7s96jbz6653ud",
"_score": 0.2876821,
"_source": {
"id": "27ibobulhqhc7s96jbz6653ud",
"title": "muscular yak",
"description": "If you are a horse, I do not want that cake"
}
}
]
}
}
So what gives? If the stop
filter is stripping English-language stopwords from indexed text
attributes, I would expect querying one of those stop words to return zero results. Do I have to explicitly tell Elasticsearch to use my_analyzer
when indexing documents or when processing queries?
For what it's worth, the other filters that I have configured (lowercase
and stemmer
) appear to work as expected. It's just stop
that is giving me trouble.
CodePudding user response:
You are almost there. You just need to map your description
field with the customer analyzer you created, as shown below. This would ensure that the content of the description
field use my_analyzer
at index as well as search time.
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"title": {
"type": "text"
},
"description": {
"type": "text",
"analyzer": "my_analyzer" // note this
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"stemmer",
"stop"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}