Elasticsearch multilingual Search-CodePudding

I am working on a project to perform multilingual full-text search using Elasticsearch. one field can contain a word combination of different languages or transliteration. for example in the English text may contain Armenian words. or Russian words in the Armenian text. and i am trying now to configure text analysis with language analyzer. How correct is my analyzer, And will it work at all ?

PUT /example{ 
 "settings": {
"analysis": {
  "filter": {
    "armenian_stop": {
      "type":       "stop",
      "stopwords":  "_armenian_" 
    },
    "armenian_keywords": {
      "type":       "keyword_marker",
      "keywords":   ["օրինակ"] 
    },
    "armenian_stemmer": {
      "type":       "stemmer",
      "language":   "armenian"
    },
    "russian_stop": {
      "type":       "stop",
      "stopwords":  "_russian_" 
    },
    "russian_keywords": {
      "type":       "keyword_marker",
      "keywords":   ["пример"] 
    },
    "russian_stemmer": {
      "type":       "stemmer",
      "language":   "russian"
    },
    "graph_synonyms": {
        "type": "synonym",
        "synonyms_path": "analysis/synonym.txt"
      }
  },
  "analyzer": {
    "rebuilt_armenian": {
      "tokenizer":  "standard",
      "filter": [
        "lowercase",
        "armenian_stop",
        "armenian_keywords",
        "armenian_stemmer",
        
        "russian_stop",
        "russian_keywords",
        "russian_stemmer",
        
        "graph_synonyms"
      ]
    }
  }
}},"mappings": {
"properties": {
  "age":    { "type": "integer" },  
  "email":  { "type": "keyword"  }, 
  "name":   { "type": "text", "analyzer": "rebuilt_armenian"  } ,
  "location": {
    "type": "geo_point"
  }
}}}

CodePudding user response：

I work in a different way with multilinguals. It seems that in your case you don't know what language it is before indexing. In my current scenario, for each language I create a field, using "fields", and for each field I use the language-specific analyzer.

{
  "settings": {
    "analysis": {
      "filter": {
        "armenian_stop": {
          "type": "stop",
          "stopwords": "_armenian_"
        },
        "armenian_keywords": {
          "type": "keyword_marker",
          "keywords": [
            "օրինակ"
          ]
        },
        "armenian_stemmer": {
          "type": "stemmer",
          "language": "armenian"
        },
        "russian_stop": {
          "type": "stop",
          "stopwords": "_russian_"
        },
        "russian_keywords": {
          "type": "keyword_marker",
          "keywords": [
            "пример"
          ]
        },
        "russian_stemmer": {
          "type": "stemmer",
          "language": "russian"
        },
        "graph_synonyms": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      },
      "analyzer": {
        "rebuilt_armenian": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "armenian_stop",
            "armenian_keywords",
            "armenian_stemmer"
          ]
        },
        "rebuilt_russian": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "russian_stop",
            "russian_keywords",
            "russian_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "age": {
        "type": "integer"
      },
      "email": {
        "type": "keyword"
      },
      "name": {
        "type": "text",
        "fields": {
          "ar": {
            "type": "text",
            "analyzer": "rebuilt_armenian"
          },
          "ru": {
            "type": "text",
            "analyzer": "rebuilt_russian"
          }
        }
      },
      "location": {
        "type": "geo_point"
      }
    }
  }
}

CodePudding user response：

And during the indexing and during the search I don't know what language the text is in. and as far as I understand it is necessary to search for specific fields, if you search for example by "name" then the standard parser will work

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "fields": [ "name.ar", "name.ru"],
            "query": "phone"
          }
        }
      ],
      "filter": [
        {
          "geo_distance": {
          "distance": "25km",
          "location": {
            "lat": 40.79420000  ,
            "lon": 43.84528000
          }
        }
        }
      ]
    }
  }
}

CodePudding user response：

You can try to check your analyzer with the analyzer API: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

Enter some mixed text and see if the result is what you want.

Sometimes it is also ok to just use standard analyzer and forget about eliminating language-specific stopwords or stemming.