What i am trying to create is an analyzer which can identify all the hashtags in a text and only index the hashtags nothing else. For that i started using a pattern_replace token filter. Below mentioned regex just captures all words which begins with Hash and replaces it with spaces what i want is exact opposte of this.
PUT analyzer_test
{
"settings" : {
"analysis" : {
"filter" : {
"hashtag" : {
"type" : "pattern_replace",
"preserve_original" : true,
"pattern" :"[#]\\w ",
"replacement": ""
}
},
"analyzer" : {
"email" : {
"tokenizer" : "whitespace",
"filter" : [ "hashtag", "lowercase", "unique" ]
}
}
}
}
}
I tried using something like this "[^#]\w " expecting it to negate the behaviour howerver not much luck there.
Few Examples :
"id" : 26930655,
"status" : 1,
"title" : "Here’s 5 underrated #BTC and realistic crypto accounts that everyone should follow: @Quinnvestments , @JacobOracle , @jevauniedaye , @ginsbergonomics , @InspoCrypto",
"hashtags" : BTC,
"created_at" : 1622390229,
"category" : null,
"language" : 50
},
{
"id" : 22521897,
"status" : 1,
"title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan