There are some examples available on the internet to customize _id field for a Elasticsearch document but is there a way to generate a composite _id of multiple fields.
Sample Data
{
"first_name": "john",
"last_name": "doe",
"dob": "1987-12-21",
"phone": "7894456123".
"so": "on"...
}
How can I configure the index pipeline to generate _id
from the join of first the 4 fields which for the use-case considered to be the composite primary key.
Things to take care:
- There is character limit on _id but the join of the 4 fields can exceed that anytime.
- using some kind of separate so there can't be 2 docs with different fields value but same joined value.
I considered using hashing algo like MD5
and SHA256
which can generate fixed length _ids from the "|".join(first,last,dob,phone)
. but not able to implement in the ingestion pipeline
This is not a security concern as we only trying to define a primary key and indexes are on a monthly rolling bases.
So if we can find a storage efficient _id value that is preferred.
if there are other ways to achieved the use-case please suggest.
CodePudding user response:
Enter the fingerprint
ingest processor (since ES 7.12.0).
You can define an ingest pipeline using that processor and set the _id
field as you expect:
PUT _ingest/pipeline/id-fingerprint
{
"processors": [
{
"fingerprint": {
"fields": ["first_name", "last_name", "dob", "phone"],
"target_field": "_id",
"method": "MD5"
}
}
]
}
Then when you index your document, you can simply reference that pipeline
PUT test/_doc/1?pipeline=id-fingerprint
{
"first_name": "john",
"last_name": "doe",
"dob": "1987-12-21",
"phone": "7894456123",
"so": "on"
}
Results =>
{
"_index" : "test",
"_type" : "_doc",
"_id" : "Xu28Onz3lbYCG0DrTTVp6Q==", <--- the generated ID
"_source" : {
"phone" : "7894456123",
"dob" : "1987-12-21",
"last_name" : "doe",
"so" : "on",
"first_name" : "john"
}
}