Home > Software design >  edge_ngram doesn't work with custom token chars
edge_ngram doesn't work with custom token chars

Time:06-27

I'm trying to allow searching by using the edge_ngram tokenizer. I followed the example in the tutorial and just added the custom_token_chars configuration as follows:

PUT test-ngrams
{
   "settings": {
     "analysis": {
        "analyzer": {
           "my_analyzer": {
              "tokenizer": "my_tokenizer"
      }
    },
    "tokenizer": {
      "my_tokenizer": {
        "type": "edge_ngram",
        "min_gram": 2,
        "max_gram": 10,
        "token_chars": [
          "letter",
          "digit"
        ],
        "custom_token_chars" : [
          "!"
        ]
      }
    }
  }
}
}

I then tried to create a search with the char ! as follows:

POST test-ngrams/_analyze
{
  "analyzer": "my_analyzer",
  "text": "!Quick Foxes."
}

But the result I'm getting ignores the !:

{
  "tokens" : [
    {
      "token" : "Qu",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Qui",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Quic",
      "start_offset" : 1,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "Quick",
      "start_offset" : 1,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "Fo",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "Fox",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "Foxe",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "Foxes",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "word",
      "position" : 7
    }
  ]
}

CodePudding user response:

The configuration for your tokenizer is incomplete. It should include custom in the list of token_chars to enable the custom_token_chars that you've set.

Character classes may be any of the following:

  • letter —  for example a, b, ï or 京
  • digit —  for example 3 or 7
  • whitespace —  for example " " or "\n"
  • punctuation — for example ! or "
  • symbol —  for example $ or √
  • custom —  custom characters which need to be set using the custom_token_chars setting.

Source: official documentation

PUT test-ngrams
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit",
            "custom"
          ],
          "custom_token_chars": [
            "!"
          ]
        }
      }
    }
  }
}
  • Related