The elastic search index has the data having CPFs.
{
"name": "A",
"cpf": "718.881.683-23",
}
{
"name": "B",
"cpf": "404.833.187-60",
}
I want to search the data by field cpf as following:
query: 718
output: doc with name "A"
query: 718.881.683-23
output: doc with name "A"
The above is working.
But the following is not working.
query: 71888168323
output: doc with name "A"
Here I want to search the doc by field CPF data but without period and hyphen also.
CodePudding user response:
718.881.683-23
is tokenized to 718 881 683 23
by the standard analyzer. So by default, you will find the document A with 718
, 718 881
, 718 and 23
, but not with 7188
as there is no such token in the field. Probably you want to specify a different analyzer, for example using the edge n-gram tokenizer.
You can create a custom analyzer specifying a filter - for example, a pattern replace like the following (strips everything that is not a digit)
"my_char_filter": {
"type": "pattern_replace",
"pattern": "[^\d]",
"replacement": ""
}
and a edge n-gram
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 11,
"token_chars": [
"digit"
]
}
CodePudding user response:
You can add a custom analyzer that will remove all characters that are not digits and only index the digits.
The analyzer looks like this:
PUT test
{
"settings": {
"analysis": {
"filter": {
"number_only": {
"type": "pattern_replace",
"pattern": "\\D"
}
},
"analyzer": {
"cpf_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"number_only"
]
}
}
}
},
"mappings": {
"properties": {
"cpf": {
"type": "text",
"analyzer": "cpf_analyzer"
}
}
}
}
Then you can index your documents as usual:
POST test/_doc
{
"name": "A",
"cpf": "718.881.683-23"
}
POST test/_doc
{
"name": "B",
"cpf": "404.833.187-60"
}
Searching for a prefix like 718
can be done like this:
POST test/_search
{
"query": {
"prefix": {
"cpf": "718"
}
}
}
Searching for the exact value with non-digit characters can be done like this:
POST test/_search
{
"query": {
"match": {
"cpf": "718.881.683-23"
}
}
}
And finally, you can also search with numbers only:
POST test/_search
{
"query": {
"match": {
"cpf": "71888168323"
}
}
}
With the given analyzer, all the above queries will return the document you expect.
If you cannot recreate your index for whatever reason, you can create a sub-field with the right analyzer and update your data in place:
PUT test/_mapping
{
"properties": {
"cpf": {
"type": "text",
"fields": {
"numeric": {
"type": "text",
"analyzer": "cpf_analyzer"
}
}
}
}
}
And then simply run the following command which will reindex all the data in place and populate the cpf.numeric
field:
POST test/_update_by_query
All your searches will then need to be done on the cpf.numeric
field instead of cpf
directly.