still pretty new to Elastic however I though that below example should work. Maybe you can point out what I'm doing wrong.
I'm trying to use a tokenizer with regex on a string and then lowercase it. Since I can't run two tokenizers on the same field without using a multifield I thought of using a token filter.
Example below:
PUT test-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "(?<!^)(?=[A-Z])"
}
},
"filter" : ["lowercase"]
}
}
}
and to test it
POST test-index/_analyze
{
"analyzer": "my_analyzer",
"text": "SomeSideCar.jpg"
}
Now I would expect to get: [some , side ,car.jpg] since the Regex splits on Uppercase letters and the tokenfilter should then lowercase the tokens.
However this is what I'm getting after running the above:
{
"tokens": [
{
"token": "Some",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "Side",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "Car.jpg",
"start_offset": 8,
"end_offset": 15,
"type": "word",
"position": 2
}
]
}
CodePudding user response:
Great start, you're almost there!!
You need to do it this way instead otherwise your custom analyzer won't use the lowercase
filter:
PUT test-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "(?<!^)(?=[A-Z])"
}
}
}
}
}