Home > database >  Elastic: Treat symbol and html encoded symbol the same during search
Elastic: Treat symbol and html encoded symbol the same during search

Time:03-29

My goal is to return the same results when searching by the symbol or html encoded version.

Example Queries:

# searching with symbol
GET my-test-index/_search
{
  "query": {
    "bool": {
      "must": {
        "simple_query_string": {
          "query": "Hello®",
          "analyzer": "english_syn",
          "fields": [
            "AllContent"
          ]
        }
      }
    }
  }
}

# html symbol
GET my-test-index/_search
{
  "query": {
    "bool": {
      "must": {
        "simple_query_string": {
          "query": "Hello®",
          "analyzer": "english_syn",
          "fields": [
            "AllContent"
          ]
        }
      }
    }
  }
}

I've tried a couple different things.

Adding synonyms but they still produced different results.

#######################################
# Synonyms
# Symbols
#######################################
™, ™
®, ®

Created a char_filter to replace special characters so they would at least be searching for "Hello". But that comes with its own set of issues that is out of scope of what I am trying to achieve.

char_filter": {
    "specialCharactersFilter": {
    "type": "pattern_replace",
    "pattern": "[^A-Za-z0-9]",
    "replacement": " "
}

I appreciate any feedback for any new alternatives to achieve this goal. Ideally a solution that covers more than ® and ­­™.

CodePudding user response:

What you are looking for is the html strip char filter, which works not only for two symbols but for a broad html characters.

Working example

Index mapping with html strip char filter

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "standard",
                    "char_filter": [
                        "html_strip"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }
    }
}

Index sample doc with just (™) in that document.

PUT 71622637/_doc/1

{
   "title" : "™"
}

Search on its html encoded version

{
    "query" :{
        "match" : {
            "title" : "&trade"
        }
    }
}

And search result

"hits": [
            {
                "_index": "71622637",
                "_id": "1",
                "_score": 0.89701396,
                "_source": {
                    "title": "™"
                }
            }
        ]

Similar to this, search on trademark symbol

{
    "query" :{
        "match" : {
            "title" : "™"
        }
    }
}

And search result

"hits": [
            {
                "_index": "71622637",
                "_id": "1",
                "_score": 0.89701396,
                "_source": {
                    "title": "™"
                }
            }
        ]
  • Related