I basically want the functionality of the inbuilt standard analyzer that additionally tokenizes on underscores.
Currently the standard analyzer will keep brown_fox_has as a singular token but I want [brown, fox, has] instead. The simple analyzer loses some functionality over the standard one, so I want to keep the standard as much as possible.
The docs only shows how you would add filters and other non-tokenizer changes, but I want to keep all of the standard tokenizer, while adding the additional underscore.
I could create a character filter to map _
to -
and the standard tokenizer will do the job for me, but is there a better way?
es.indices.create(index="mine", body={
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
# "tokenize_on_chars": ["_"], # i want this to work with the standard tokenizer without using char group
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
},
}
})
res = es.indices.analyze(index="mine", body={
"field": "text",
"text": "the quick brown_fox_has to be split"
})
CodePudding user response:
Use normalizer and define it along with your preferred standard tokenizer
POST /_analyze
{
"char_filter": {
"type": "mapping",
"mappings": [
"_ =>\\u0020" // replace underscore with whitespace
]
},
"tokenizer": "standard",
"text": "the quick brown_fox_has to be split"
}