Home > other >  How to remove all numbers from the Elasticsearch term vector?
How to remove all numbers from the Elasticsearch term vector?

Time:11-24

How can I remove all numbers from the Elasticsearch term vector? I basically would like to have only real text/words in the term vector and now numbers or invalid strings.

Here is my index definition:

{
    "mappings": {
        "_source": {
            "enabled": true
        },
        "properties": {
            "attachment.content": {
                "analyzer": "english_analyzer",
                "term_vector": "yes",
                "type": "text"
            },
            "class": {
                "type": "integer"
            },
            "label": {
                "type": "integer"
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "english_analyzer": {
                    "filter": [
                        "english_possessive_stemmer",
                        "lowercase",
                        "english_stop",
                        "english_stemmer"
                    ],
                    "tokenizer": "standard",
                    "type": "custom"
                }
            },
            "filter": {
                "english_possessive_stemmer": {
                    "language": "possessive_english",
                    "type": "stemmer"
                },
                "english_stemmer": {
                    "language": "english",
                    "type": "stemmer"
                },
                "english_stop": {
                    "stopwords": "_english_",
                    "type": "stop"
                }
            }
        },
        "number_of_shards": 1
    }
}

I tried it with a conditional token filter which looks like this:

{
  "type": "condition",
  "filter": [ "remove" ],
  "script": {
    "source": "token.getType() == '<NUM>'"
  }
}

but I don't know how to remove the token if the condition is true. Obviously the "remove" filter does not exist.

  1. Is there a filter to remove the token from the term vector, or is there a better way to do it?

  2. Where can I find a documentation about the "source" part of the conditional scripts?

Thanks for your help

CodePudding user response:

You need to include the Keep Token filter in your filter array and only keep <ALPHANUM> tokens

  • Related