I noticed that the ES Italian analyzer does not stem words long less than 6 characters and this obviously creates a problem for my work. I tried to solve it customizing the analyzer but unfortunately did not succeed. So I implemented in the index an hunspell analyzer but it isn't very scalable so I want to keep the analyzer algorithmic. Does someone have a suggestion on how to solve this problem?
CodePudding user response:
The default Italian language stemmer in Elasticsearch is not the normal snowball stemmer, but a light version called light_italian. I was able to reproduce that it doesn't stem some tokens that are shorter than 6 characters, as you described:
POST /_analyze
{
"analyzer": "italian",
"text": "pronto propio logie logia morte"
}
But Elasticsearch includes another italian stemmer token filter called italian
that performs stemming on these tokens. You can test it with this code:
PUT /my-italian-stemmer-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stemmer"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"language": "italian"
}
}
}
}
}
POST /my-italian-stemmer-index/_analyze
{
"analyzer": "my_analyzer",
"text": "pronto propio logie logia morte"
}
If you want to use it, you should rebuild the original Italian analyzer and swap out the token filter:
PUT /italian_example
{
"settings": {
"analysis": {
"filter": {
"italian_elision": {
"type": "elision",
"articles": [
"c", "l", "all", "dall", "dell",
"nell", "sull", "coll", "pell",
"gl", "agl", "dagl", "degl", "negl",
"sugl", "un", "m", "t", "s", "v", "d"
],
"articles_case": true
},
"italian_stop": {
"type": "stop",
"stopwords": "_italian_"
},
"italian_keywords": {
"type": "keyword_marker",
"keywords": ["esempio"]
},
"italian_stemmer": {
"type": "stemmer",
"language": "italian"
}
},
"analyzer": {
"rebuilt_italian": {
"tokenizer": "standard",
"filter": [
"italian_elision",
"lowercase",
"italian_stop",
"italian_keywords",
"italian_stemmer"
]
}
}
}
}
}