Home > Mobile >  ElasticSearch custom analyzer breaks words containing special characters
ElasticSearch custom analyzer breaks words containing special characters

Time:09-07

If user searches for foo(bar), elasticsearch breaks it into foo and bar.

What I'm trying to achieve, is when a user types in say, i want a foo(bar), I match exactly an item named foo(bar), the name is fixed, and it will be used by a filter, so it is set to a keyword type.

The approximate steps I did,

  1. define a custom analyzer
  2. define a dictionary containing foo(bar)
  3. define a synonym mapping containing abc => foo(bar)

Now, when I search for abc, elasticsearch translates it to foo(bar), but then it breaks it into foo and bar.

The question, as you may have guessed, is how to preserve special characters in elasticsearch analyzer?

I tried to use quotes(") in the dictionary file, like "foo(bar)", it didn't work.

Or is there maybe another way to work around this problem?

By the way, I'm using foo(bar) here just for simplicity, the actual case is much more complicated.

Thanks in advance.

---- edit ----

Thanks to @star67, now I realized that it is the issue of the tokenizer.

I am using the plugin medcl/elasticsearch-analysis-ik, which provides an ik_smart tokenizer, which is designed for the Chinese language.

Although I realized what the real problem is, I still don't know how to solve it. I mean, I have to use the ik_smart tokenizer, but how do I modify it to exclude some special characters?

I know that I can define a custom pattern tokenizer like this as @star67 provided:

{
  "tokenizer": {
    "my_tokenizer": {
      "type": "pattern",
      "pattern": "[^\\w\\(\\)] "
    }
  }
}

But I also want to use the ik_smart tokenizer, because in Chinese, words and characters are not separated by space, for example 弹性搜索很厉害 should be tokenized as ['弹性', '搜索', '很', '厉害'], words can only be split based on a dictionary, so the default behavior is not desirable. What I want is maybe something like this:

{
  "tokenizer": {
    "my_tokenizer": {
      "tokenizer": "ik_smart",
      "ignores": "[\\w\\(\\)] "
    }
  }
}

And I couldn't find an equivalent setting in elasticsearch.

Is it that I have to build my own plugin to achieve this?

CodePudding user response:

You might want to use another tokenizer in your custom analyzer for your index.

For example, the standard tokenizer (used via analyzer for short) splits by all non-word characters (\W ):

POST _analyze
{
  "analyzer": "standard",
  "text": "foo(bar)"
}

==>

{
  "tokens" : [
    {
      "token" : "foo",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "bar",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Compare to a custom tokenizer, that splits by all non-word characters except the ( and ) ( which is [^\w\(\)] ):

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[^\\w\\(\\)] "
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "foo(bar)"
}

===>

{
  "tokens" : [
    {
      "token" : "foo(bar)",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "word",
      "position" : 0
    }
  ]
}

I used a Pattern Tokenier as an example to exclude certain symbols (( and ) in your case) from being used in tokenization.

  • Related