Elasticsearch _id as MD5 hash or document fields-CodePudding

There are some examples available on the internet to customize _id field for a Elasticsearch document but is there a way to generate a composite _id of multiple fields.

Sample Data

{
  "first_name": "john",
  "last_name": "doe",
  "dob": "1987-12-21",
  "phone": "7894456123".
  "so": "on"...
}

How can I configure the index pipeline to generate _id from the join of first the 4 fields which for the use-case considered to be the composite primary key.

Things to take care:

There is character limit on _id but the join of the 4 fields can exceed that anytime.
using some kind of separate so there can't be 2 docs with different fields value but same joined value.

I considered using hashing algo like MD5 and SHA256 which can generate fixed length _ids from the "|".join(first,last,dob,phone). but not able to implement in the ingestion pipeline

This is not a security concern as we only trying to define a primary key and indexes are on a monthly rolling bases.

So if we can find a storage efficient _id value that is preferred.

if there are other ways to achieved the use-case please suggest.

CodePudding user response：

Enter the fingerprint ingest processor (since ES 7.12.0).

You can define an ingest pipeline using that processor and set the _id field as you expect:

PUT _ingest/pipeline/id-fingerprint
{
  "processors": [
    {
      "fingerprint": {
        "fields": ["first_name", "last_name", "dob", "phone"],
        "target_field": "_id",
        "method": "MD5"
      }
    }
  ]
}

Then when you index your document, you can simply reference that pipeline

PUT test/_doc/1?pipeline=id-fingerprint
{
  "first_name": "john",
  "last_name": "doe",
  "dob": "1987-12-21",
  "phone": "7894456123",
  "so": "on"
}

Results =>

{
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "Xu28Onz3lbYCG0DrTTVp6Q==",      <--- the generated ID
    "_source" : {
      "phone" : "7894456123",
      "dob" : "1987-12-21",
      "last_name" : "doe",
      "so" : "on",
      "first_name" : "john"
    }
  }