I am working on a project to perform multilingual full-text search using Elasticsearch. one field can contain a word combination of different languages or transliteration. for example in the English text may contain Armenian words. or Russian words in the Armenian text. and i am trying now to configure text analysis with language analyzer. How correct is my analyzer, And will it work at all ?
PUT /example{
"settings": {
"analysis": {
"filter": {
"armenian_stop": {
"type": "stop",
"stopwords": "_armenian_"
},
"armenian_keywords": {
"type": "keyword_marker",
"keywords": ["օրինակ"]
},
"armenian_stemmer": {
"type": "stemmer",
"language": "armenian"
},
"russian_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"russian_keywords": {
"type": "keyword_marker",
"keywords": ["пример"]
},
"russian_stemmer": {
"type": "stemmer",
"language": "russian"
},
"graph_synonyms": {
"type": "synonym",
"synonyms_path": "analysis/synonym.txt"
}
},
"analyzer": {
"rebuilt_armenian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"armenian_stop",
"armenian_keywords",
"armenian_stemmer",
"russian_stop",
"russian_keywords",
"russian_stemmer",
"graph_synonyms"
]
}
}
}},"mappings": {
"properties": {
"age": { "type": "integer" },
"email": { "type": "keyword" },
"name": { "type": "text", "analyzer": "rebuilt_armenian" } ,
"location": {
"type": "geo_point"
}
}}}
CodePudding user response:
I work in a different way with multilinguals. It seems that in your case you don't know what language it is before indexing. In my current scenario, for each language I create a field, using "fields", and for each field I use the language-specific analyzer.
{
"settings": {
"analysis": {
"filter": {
"armenian_stop": {
"type": "stop",
"stopwords": "_armenian_"
},
"armenian_keywords": {
"type": "keyword_marker",
"keywords": [
"օրինակ"
]
},
"armenian_stemmer": {
"type": "stemmer",
"language": "armenian"
},
"russian_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"russian_keywords": {
"type": "keyword_marker",
"keywords": [
"пример"
]
},
"russian_stemmer": {
"type": "stemmer",
"language": "russian"
},
"graph_synonyms": {
"type": "synonym",
"synonyms_path": "analysis/synonym.txt"
}
},
"analyzer": {
"rebuilt_armenian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"armenian_stop",
"armenian_keywords",
"armenian_stemmer"
]
},
"rebuilt_russian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"age": {
"type": "integer"
},
"email": {
"type": "keyword"
},
"name": {
"type": "text",
"fields": {
"ar": {
"type": "text",
"analyzer": "rebuilt_armenian"
},
"ru": {
"type": "text",
"analyzer": "rebuilt_russian"
}
}
},
"location": {
"type": "geo_point"
}
}
}
}
CodePudding user response:
And during the indexing and during the search I don't know what language the text is in. and as far as I understand it is necessary to search for specific fields, if you search for example by "name" then the standard parser will work
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [ "name.ar", "name.ru"],
"query": "phone"
}
}
],
"filter": [
{
"geo_distance": {
"distance": "25km",
"location": {
"lat": 40.79420000 ,
"lon": 43.84528000
}
}
}
]
}
}
}
CodePudding user response:
You can try to check your analyzer with the analyzer API: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
Enter some mixed text and see if the result is what you want.
Sometimes it is also ok to just use standard analyzer and forget about eliminating language-specific stopwords or stemming.