How to remove all numbers from the Elasticsearch term vector?-CodePudding

How can I remove all numbers from the Elasticsearch term vector? I basically would like to have only real text/words in the term vector and now numbers or invalid strings.

Here is my index definition:

{
    "mappings": {
        "_source": {
            "enabled": true
        },
        "properties": {
            "attachment.content": {
                "analyzer": "english_analyzer",
                "term_vector": "yes",
                "type": "text"
            },
            "class": {
                "type": "integer"
            },
            "label": {
                "type": "integer"
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "english_analyzer": {
                    "filter": [
                        "english_possessive_stemmer",
                        "lowercase",
                        "english_stop",
                        "english_stemmer"
                    ],
                    "tokenizer": "standard",
                    "type": "custom"
                }
            },
            "filter": {
                "english_possessive_stemmer": {
                    "language": "possessive_english",
                    "type": "stemmer"
                },
                "english_stemmer": {
                    "language": "english",
                    "type": "stemmer"
                },
                "english_stop": {
                    "stopwords": "_english_",
                    "type": "stop"
                }
            }
        },
        "number_of_shards": 1
    }
}

I tried it with a conditional token filter which looks like this:

{
  "type": "condition",
  "filter": [ "remove" ],
  "script": {
    "source": "token.getType() == '<NUM>'"
  }
}

but I don't know how to remove the token if the condition is true. Obviously the "remove" filter does not exist.

Is there a filter to remove the token from the term vector, or is there a better way to do it?
Where can I find a documentation about the "source" part of the conditional scripts?

Thanks for your help

CodePudding user response：

You need to include the Keep Token filter in your filter array and only keep <ALPHANUM> tokens