Home > Blockchain >  removing special characters and words from a url elasticsearch
removing special characters and words from a url elasticsearch

Time:03-09

I was looking for a way to generate words and special characters as tokens from a url.

eg. I have a url https://www.google.com/

I want to generate tokens in elastic as https, www,google, com, :, /, /, ., ., /

CodePudding user response:

You can define custom analyzer with letter tokenizer as shown below:

PUT index3
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email": {
          "tokenizer": "letter",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}

Test API:

POST index3/_analyze
{
  "text": [
    "https://www.google.com/"
  ],
  "analyzer": "my_email"
  
}

Output:

{
  "tokens" : [
    {
      "token" : "https",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "www",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "google",
      "start_offset" : 12,
      "end_offset" : 18,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "com",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "word",
      "position" : 3
    }
  ]
}

  • Related