I want to use an Elasticsearch's Token filter that act like word_delimiter_graph but split tokens on specific delimiter only (if I am not wrong, default word_delimiter_graph
does not allow to use custom delimiters list).
For example, I only want to split tokens on -
delimiter only:
i-pod
-> [i-pod, i, pod]
i_pod
-> [i_pod]
(since I only want to split on -
only and not any other characters.)
How can I archive that?
Thank you!
CodePudding user response:
I used some parameter type_table.
(Optional, array of strings) Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
For example, the following array maps the plus ( ) and hyphen (-) characters as alphanumeric, which means they won’t be treated as delimiters.
Tests:
i-pad
GET /_analyze
{
"tokenizer": "keyword",
"filter": {
"type": "word_delimiter_graph",
"preserve_original": true,
"type_table": [ "_ => ALPHA" ]
},
"text": "i-pad"
}
Tokens:
{
"tokens": [
{
"token": "i-pad",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0,
"positionLength": 2
},
{
"token": "i",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "pad",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
}
]
}
i_pad
GET /_analyze
{
"tokenizer": "keyword",
"filter": {
"type": "word_delimiter_graph",
"preserve_original": true,
"type_table": [ "_ => ALPHA" ]
},
"text": "i_pad"
}
Tokens:
{
"tokens": [
{
"token": "i_pad",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}