How can I remove all numbers from the Elasticsearch term vector? I basically would like to have only real text/words in the term vector and now numbers or invalid strings.
Here is my index definition:
{
"mappings": {
"_source": {
"enabled": true
},
"properties": {
"attachment.content": {
"analyzer": "english_analyzer",
"term_vector": "yes",
"type": "text"
},
"class": {
"type": "integer"
},
"label": {
"type": "integer"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"english_analyzer": {
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_stemmer"
],
"tokenizer": "standard",
"type": "custom"
}
},
"filter": {
"english_possessive_stemmer": {
"language": "possessive_english",
"type": "stemmer"
},
"english_stemmer": {
"language": "english",
"type": "stemmer"
},
"english_stop": {
"stopwords": "_english_",
"type": "stop"
}
}
},
"number_of_shards": 1
}
}
I tried it with a conditional token filter which looks like this:
{
"type": "condition",
"filter": [ "remove" ],
"script": {
"source": "token.getType() == '<NUM>'"
}
}
but I don't know how to remove the token if the condition is true. Obviously the "remove" filter does not exist.
Is there a filter to remove the token from the term vector, or is there a better way to do it?
Where can I find a documentation about the "source" part of the conditional scripts?
Thanks for your help
CodePudding user response:
You need to include the Keep Token filter in your filter array and only keep <ALPHANUM>
tokens