Home > Software design >  Elasticsearch: Problem with Italian analyzer
Elasticsearch: Problem with Italian analyzer

Time:11-02

I noticed that the ES Italian analyzer does not stem words long less than 6 characters and this obviously creates a problem for my work. I tried to solve it customizing the analyzer but unfortunately did not succeed. So I implemented in the index an hunspell analyzer but it isn't very scalable so I want to keep the analyzer algorithmic. Does someone have a suggestion on how to solve this problem?

CodePudding user response:

The default Italian language stemmer in Elasticsearch is not the normal snowball stemmer, but a light version called light_italian. I was able to reproduce that it doesn't stem some tokens that are shorter than 6 characters, as you described:

POST /_analyze
{
  "analyzer": "italian",
  "text": "pronto propio logie logia morte"
}

But Elasticsearch includes another italian stemmer token filter called italian that performs stemming on these tokens. You can test it with this code:

PUT /my-italian-stemmer-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stemmer"
          ]
        }
      },
      "filter": {
        "my_stemmer": {
          "type": "stemmer",
          "language": "italian"
        }
      }
    }
  }
}

POST /my-italian-stemmer-index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "pronto propio logie logia morte"
}

If you want to use it, you should rebuild the original Italian analyzer and swap out the token filter:

PUT /italian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "italian_elision": {
          "type": "elision",
          "articles": [
                "c", "l", "all", "dall", "dell",
                "nell", "sull", "coll", "pell",
                "gl", "agl", "dagl", "degl", "negl",
                "sugl", "un", "m", "t", "s", "v", "d"
          ],
          "articles_case": true
        },
        "italian_stop": {
          "type":       "stop",
          "stopwords":  "_italian_" 
        },
        "italian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["esempio"] 
        },
        "italian_stemmer": {
          "type":       "stemmer",
          "language":   "italian"
        }
      },
      "analyzer": {
        "rebuilt_italian": {
          "tokenizer":  "standard",
          "filter": [
            "italian_elision",
            "lowercase",
            "italian_stop",
            "italian_keywords",
            "italian_stemmer"
          ]
        }
      }
    }
  }
}
  • Related