Home > Enterprise >  Pattern Replace token filter with regex in Elasticsearch
Pattern Replace token filter with regex in Elasticsearch

Time:12-17

What i am trying to create is an analyzer which can identify all the hashtags in a text and only index the hashtags nothing else. For that i started using a pattern_replace token filter. Below mentioned regex just captures all words which begins with Hash and replaces it with spaces what i want is exact opposte of this.

    PUT analyzer_test
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "hashtag" : {
               "type" : "pattern_replace",
               "preserve_original" : true,
               "pattern" :"[#]\\w ",
               "replacement": ""
            }
         },
         "analyzer" : {
            "email" : {
               "tokenizer" : "whitespace",
               "filter" : [ "hashtag", "lowercase",  "unique" ]
            }
         }
      }
   }
}

I tried using something like this "[^#]\w " expecting it to negate the behaviour howerver not much luck there.

Few Examples :

 "id" : 26930655,
      "status" : 1,
      "title" : "Here’s 5 underrated #BTC and realistic crypto accounts that everyone should follow:  @Quinnvestments , @JacobOracle , @jevauniedaye , @ginsbergonomics , @InspoCrypto",
      "hashtags" : BTC,
      "created_at" : 1622390229,
      "category" : null,
      "language" : 50
    },
    {
          "id" : 22521897,
          "status" : 1,
          "title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan            
  • Related