Home > Software design >  Unable to understand elasticsearch analyser regex
Unable to understand elasticsearch analyser regex

Time:03-16

Can someone help me understand why my understanding of an elasticsearch analyser is wrong?

I have an index containing various fields, one in particular is:

"categories": {
    "type": "text",
    "analyzer": "words_only_analyser",
    "copy_to": "all",
    "fields": {
         "tokens": {
             "type": "text",
             "analyzer": "words_only_analyser",
             "term_vector": "yes",
             "fielddata" : True
          }
      }
}

The words_only_analyser looks like:

"words_only_analyser":{
    "type":"custom",
    "tokenizer":"words_only_tokenizer",
    "char_filter" : ["html_strip"],
    "filter":[ "lowercase", "asciifolding", "stop_filter", "kstem" ]
},

and the words_only_tokenizer looks like:

"tokenizer":{
    "words_only_tokenizer":{
    "type":"pattern",
    "pattern":"[^\\w-] "
    }
}

My understanding of the pattern [^\\w-] in tokenizer is that it will tokenize a sentence such that it splits them at any number of occurrence of \ or w or -. For example, given the pattern, a sentence of:

seasonal-christmas-halloween this is a description about halloween

I expect to see:

[seasonal, christmas, hallo, een this is a description about hallo, een]

I can confirm the above from https://regex101.com/

However, when I run words_only_analyser on the sentence above:

curl -XGET localhost:9200/contextual/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer":"words_only_analyser","text":"seasonal-christmas-halloween this is a description about halloween"}'

I get,

{
  "tokens" : [
    {
      "token" : "seasonal-christmas-halloween",
      "start_offset" : 0,
      "end_offset" : 28,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "description",
      "start_offset" : 39,
      "end_offset" : 50,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "halloween",
      "start_offset" : 57,
      "end_offset" : 66,
      "type" : "word",
      "position" : 6
    }
  ]
}

This tells me the sentence gets tokenized to:

[seasonal-christmas-halloween, description, halloween]

It appears to me the tokenizer pattern is not being fulfilled? Can someone explain to me where my understanding is incorrect?

CodePudding user response:

There are few things, which is changing the final tokens produced by your analyzer, first is the tokenizer and after that the token-filters(for ex:you have stop_filter that removes the stop words like this, is, a).

you can use the analyze API to test your tokenizer as well, I created your configuration and it produces below tokens.

POST _analyze

{
    "tokenizer": "words_only_tokenizer", // Note `tokenizer` here
    "text": "seasonal-christmas-halloween this is a description about halloween"
}

Result

{
    "tokens": [
        {
            "token": "seasonal-christmas-halloween",
            "start_offset": 0,
            "end_offset": 28,
            "type": "word",
            "position": 0
        },
        {
            "token": "this",
            "start_offset": 29,
            "end_offset": 33,
            "type": "word",
            "position": 1
        },
        {
            "token": "is",
            "start_offset": 34,
            "end_offset": 36,
            "type": "word",
            "position": 2
        },
        {
            "token": "a",
            "start_offset": 37,
            "end_offset": 38,
            "type": "word",
            "position": 3
        },
        {
            "token": "description",
            "start_offset": 39,
            "end_offset": 50,
            "type": "word",
            "position": 4
        },
        {
            "token": "about",
            "start_offset": 51,
            "end_offset": 56,
            "type": "word",
            "position": 5
        },
        {
            "token": "halloween",
            "start_offset": 57,
            "end_offset": 66,
            "type": "word",
            "position": 6
        }
    ]
}

You can notice, still stop words are present, as its just breaking the tokens on whitespace and not considering -.

Now if you run same on analyzer which also has filters, it would reduce the stop words and gives you below tokens.

POST _analyze

{
    "analyzer": "words_only_analyser",
    "text": "seasonal-christmas-halloween this is a description about halloween"
}

Result

{
    "tokens": [
        {
            "token": "seasonal-christmas-halloween",
            "start_offset": 0,
            "end_offset": 28,
            "type": "word",
            "position": 0
        },
        {
            "token": "description",
            "start_offset": 39,
            "end_offset": 50,
            "type": "word",
            "position": 4
        },
        {
            "token": "about",
            "start_offset": 51,
            "end_offset": 56,
            "type": "word",
            "position": 5
        },
        {
            "token": "halloween",
            "start_offset": 57,
            "end_offset": 66,
            "type": "word",
            "position": 6
        }
    ]
}
  • Related