Can someone help me understand why my understanding of an elasticsearch analyser is wrong?
I have an index containing various fields, one in particular is:
"categories": {
"type": "text",
"analyzer": "words_only_analyser",
"copy_to": "all",
"fields": {
"tokens": {
"type": "text",
"analyzer": "words_only_analyser",
"term_vector": "yes",
"fielddata" : True
}
}
}
The words_only_analyser
looks like:
"words_only_analyser":{
"type":"custom",
"tokenizer":"words_only_tokenizer",
"char_filter" : ["html_strip"],
"filter":[ "lowercase", "asciifolding", "stop_filter", "kstem" ]
},
and the words_only_tokenizer
looks like:
"tokenizer":{
"words_only_tokenizer":{
"type":"pattern",
"pattern":"[^\\w-] "
}
}
My understanding of the pattern
[^\\w-]
in tokenizer
is that it will tokenize a sentence such that it splits them at any number of occurrence of \
or w
or -
. For example, given the pattern, a sentence of:
seasonal-christmas-halloween this is a description about halloween
I expect to see:
[seasonal, christmas, hallo, een this is a description about hallo, een]
I can confirm the above from https://regex101.com/
However, when I run words_only_analyser
on the sentence above:
curl -XGET localhost:9200/contextual/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer":"words_only_analyser","text":"seasonal-christmas-halloween this is a description about halloween"}'
I get,
{
"tokens" : [
{
"token" : "seasonal-christmas-halloween",
"start_offset" : 0,
"end_offset" : 28,
"type" : "word",
"position" : 0
},
{
"token" : "description",
"start_offset" : 39,
"end_offset" : 50,
"type" : "word",
"position" : 4
},
{
"token" : "halloween",
"start_offset" : 57,
"end_offset" : 66,
"type" : "word",
"position" : 6
}
]
}
This tells me the sentence gets tokenized to:
[seasonal-christmas-halloween, description, halloween]
It appears to me the tokenizer pattern is not being fulfilled? Can someone explain to me where my understanding is incorrect?
CodePudding user response:
There are few things, which is changing the final tokens produced by your analyzer, first is the tokenizer and after that the token-filters(for ex:you have stop_filter that removes the stop words like this
, is
, a
).
you can use the analyze API to test your tokenizer
as well, I created your configuration and it produces below tokens.
POST _analyze
{
"tokenizer": "words_only_tokenizer", // Note `tokenizer` here
"text": "seasonal-christmas-halloween this is a description about halloween"
}
Result
{
"tokens": [
{
"token": "seasonal-christmas-halloween",
"start_offset": 0,
"end_offset": 28,
"type": "word",
"position": 0
},
{
"token": "this",
"start_offset": 29,
"end_offset": 33,
"type": "word",
"position": 1
},
{
"token": "is",
"start_offset": 34,
"end_offset": 36,
"type": "word",
"position": 2
},
{
"token": "a",
"start_offset": 37,
"end_offset": 38,
"type": "word",
"position": 3
},
{
"token": "description",
"start_offset": 39,
"end_offset": 50,
"type": "word",
"position": 4
},
{
"token": "about",
"start_offset": 51,
"end_offset": 56,
"type": "word",
"position": 5
},
{
"token": "halloween",
"start_offset": 57,
"end_offset": 66,
"type": "word",
"position": 6
}
]
}
You can notice, still stop words are present, as its just breaking the tokens on whitespace and not considering -
.
Now if you run same on analyzer
which also has filters
, it would reduce the stop words
and gives you below tokens.
POST _analyze
{
"analyzer": "words_only_analyser",
"text": "seasonal-christmas-halloween this is a description about halloween"
}
Result
{
"tokens": [
{
"token": "seasonal-christmas-halloween",
"start_offset": 0,
"end_offset": 28,
"type": "word",
"position": 0
},
{
"token": "description",
"start_offset": 39,
"end_offset": 50,
"type": "word",
"position": 4
},
{
"token": "about",
"start_offset": 51,
"end_offset": 56,
"type": "word",
"position": 5
},
{
"token": "halloween",
"start_offset": 57,
"end_offset": 66,
"type": "word",
"position": 6
}
]
}