Home > Software design >  Elastic Search - search the data ignoring periods or
Elastic Search - search the data ignoring periods or

Time:02-18

The elastic search index has the data having CPFs.

{
  "name": "A",
  "cpf": "718.881.683-23",
}

{
  "name": "B",
  "cpf": "404.833.187-60",
}

I want to search the data by field cpf as following:

query: 718
output: doc with name "A"
query: 718.881.683-23
output: doc with name "A"

The above is working.

But the following is not working.

query: 71888168323
output: doc with name "A"

Here I want to search the doc by field CPF data but without period and hyphen also.

CodePudding user response:

718.881.683-23 is tokenized to 718 881 683 23 by the standard analyzer. So by default, you will find the document A with 718, 718 881, 718 and 23, but not with 7188 as there is no such token in the field. Probably you want to specify a different analyzer, for example using the edge n-gram tokenizer.

You can create a custom analyzer specifying a filter - for example, a pattern replace like the following (strips everything that is not a digit)

"my_char_filter": {
          "type": "pattern_replace",
          "pattern": "[^\d]",
          "replacement": ""
}

and a edge n-gram

  "my_tokenizer": {
           "type": "edge_ngram",
           "min_gram": 1,
           "max_gram": 11,
           "token_chars": [
             "digit"
           ]
   }

CodePudding user response:

You can add a custom analyzer that will remove all characters that are not digits and only index the digits.

The analyzer looks like this:

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "number_only": {
          "type": "pattern_replace",
          "pattern": "\\D"
        }
      },
      "analyzer": {
        "cpf_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "number_only"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "cpf": {
        "type": "text",
        "analyzer": "cpf_analyzer"
      }
    }
  }
}

Then you can index your documents as usual:

POST test/_doc
{
  "name": "A",
  "cpf": "718.881.683-23"
}

POST test/_doc
{
  "name": "B",
  "cpf": "404.833.187-60"
}

Searching for a prefix like 718 can be done like this:

POST test/_search
{
  "query": {
    "prefix": {
      "cpf": "718"
    }
  }
}

Searching for the exact value with non-digit characters can be done like this:

POST test/_search
{
  "query": {
    "match": {
      "cpf": "718.881.683-23"
    }
  }
}

And finally, you can also search with numbers only:

POST test/_search
{
  "query": {
    "match": {
      "cpf": "71888168323"
    }
  }
}

With the given analyzer, all the above queries will return the document you expect.

If you cannot recreate your index for whatever reason, you can create a sub-field with the right analyzer and update your data in place:

PUT test/_mapping
{
  "properties": {
    "cpf": {
      "type": "text",
      "fields": {
        "numeric": {
          "type": "text",
          "analyzer": "cpf_analyzer"
        }
      }
    }
  }
}

And then simply run the following command which will reindex all the data in place and populate the cpf.numeric field:

POST test/_update_by_query

All your searches will then need to be done on the cpf.numeric field instead of cpf directly.

  • Related