If user searches for foo(bar)
, elasticsearch breaks it into foo
and bar
.
What I'm trying to achieve, is when a user types in say, i want a foo(bar)
, I match exactly an item named foo(bar)
, the name is fixed, and it will be used by a filter, so it is set to a keyword type.
The approximate steps I did,
- define a custom analyzer
- define a dictionary containing
foo(bar)
- define a synonym mapping containing
abc => foo(bar)
Now, when I search for abc
, elasticsearch translates it to foo(bar)
, but then it breaks it into foo
and bar
.
The question, as you may have guessed, is how to preserve special characters in elasticsearch analyzer?
I tried to use quotes(") in the dictionary file, like "foo(bar)"
, it didn't work.
Or is there maybe another way to work around this problem?
By the way, I'm using foo(bar)
here just for simplicity, the actual case is much more complicated.
Thanks in advance.
---- edit ----
Thanks to @star67, now I realized that it is the issue of the tokenizer.
I am using the plugin medcl/elasticsearch-analysis-ik
, which provides an ik_smart
tokenizer, which is designed for the Chinese language.
Although I realized what the real problem is, I still don't know how to solve it. I mean, I have to use the ik_smart
tokenizer, but how do I modify it to exclude some special characters?
I know that I can define a custom pattern tokenizer like this as @star67 provided:
{
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[^\\w\\(\\)] "
}
}
}
But I also want to use the ik_smart
tokenizer, because in Chinese, words and characters are not separated by space, for example 弹性搜索很厉害
should be tokenized as ['弹性', '搜索', '很', '厉害']
, words can only be split based on a dictionary, so the default behavior is not desirable. What I want is maybe something like this:
{
"tokenizer": {
"my_tokenizer": {
"tokenizer": "ik_smart",
"ignores": "[\\w\\(\\)] "
}
}
}
And I couldn't find an equivalent setting in elasticsearch.
Is it that I have to build my own plugin to achieve this?
CodePudding user response:
You might want to use another tokenizer in your custom analyzer for your index.
For example, the standard
tokenizer (used via analyzer for short) splits by all non-word characters (\W
):
POST _analyze
{
"analyzer": "standard",
"text": "foo(bar)"
}
==>
{
"tokens" : [
{
"token" : "foo",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "bar",
"start_offset" : 4,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Compare to a custom tokenizer, that splits by all non-word characters except the (
and )
( which is [^\w\(\)]
):
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[^\\w\\(\\)] "
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "foo(bar)"
}
===>
{
"tokens" : [
{
"token" : "foo(bar)",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
}
]
}
I used a Pattern Tokenier as an example to exclude certain symbols ((
and )
in your case) from being used in tokenization.