Home > Enterprise >  Elasticsearch word_delimiter_graph split token on specific delimiter only
Elasticsearch word_delimiter_graph split token on specific delimiter only

Time:01-05

I want to use an Elasticsearch's Token filter that act like word_delimiter_graph but split tokens on specific delimiter only (if I am not wrong, default word_delimiter_graph does not allow to use custom delimiters list).

For example, I only want to split tokens on - delimiter only:

i-pod -> [i-pod, i, pod]

i_pod -> [i_pod] (since I only want to split on - only and not any other characters.)

How can I archive that?

Thank you!

CodePudding user response:

I used some parameter type_table.

(Optional, array of strings) Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.

For example, the following array maps the plus ( ) and hyphen (-) characters as alphanumeric, which means they won’t be treated as delimiters.

Tests:

i-pad

GET /_analyze
{
  "tokenizer": "keyword",
  "filter": {
    "type": "word_delimiter_graph",
    "preserve_original": true,
    "type_table": [ "_ => ALPHA" ]
  },
  "text": "i-pad"
}

Tokens:

{
  "tokens": [
    {
      "token": "i-pad",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "i",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "pad",
      "start_offset": 2,
      "end_offset": 5,
      "type": "word",
      "position": 1
    }
  ]
}

i_pad

GET /_analyze
{
  "tokenizer": "keyword",
  "filter": {
    "type": "word_delimiter_graph",
    "preserve_original": true,
    "type_table": [ "_ => ALPHA" ]
  },
  "text": "i_pad"
}

Tokens:

{
  "tokens": [
    {
      "token": "i_pad",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    }
  ]
}
  • Related