Home > OS >  Enriching the Data in Elastic Search
Enriching the Data in Elastic Search

Time:04-26

We will be ingesting data into an Index (Index1), however one of the fields in the document(field1) is an ENUM value, which needs to be converted into a value (string) using a lookup through a rest api call. the rest api call gives a JSON in response like this which has string values for all the ENUMS.

{
values : {
"ENUMVALUE1" : "StringValue1",
"ENUMVALUE2" : "StringValue2"
}
}

I am thinking of making an index from this response document and use that for the lookup. The incoming document has field1 as ENUMVALUE1 or ENUMVALUE2 (only one of them) and we want to eventually save StringValue1 or StringValue2 in the document under field1 and not ENUMVALUE1.

I went through the documentation of enrichment processor however I am not sure if that is the correct approach to handle this scenario. While forming the match enrich policy I am not sure how match_field and enrich_fields should be configured.

Could you please advise if this can be done in Elastic and if yes what possible options do I have if the above one is not an optimal approach.

CodePudding user response:

OK, 150-200 enums might not be enough to use an enrich index, but here is a potential solution.

You first need to build the source index containing all enum mappings, it would look like this:

POST enums/_doc/_bulk
{"index":{}}
{"enum_id": "ENUMVALUE1", "string_value": "StringValue1"}
{"index":{}}
{"enum_id": "ENUMVALUE2", "string_value": "StringValue2"}

Then you need to create an enrich policy out of this index:

PUT /_enrich/policy/enum-policy
{
  "match": {
    "indices": "enums",
    "match_field": "enum_id",
    "enrich_fields": [
      "string_value"
    ]
  }
}
POST /_enrich/policy/enum-policy/_execute

Once it's built (with 200 values it should take a few seconds), you can start building your ingest pipeline using an ingest processor:

PUT _ingest/pipeline/enum-pipeline
{
  "description": "Enum enriching pipeline",
  "processors": [
    {
      "enrich" : {
        "policy_name": "enum-policy",
        "field" : "field1",
        "target_field": "tmp"
      }
    },
    {
      "set": {
        "if": "ctx.tmp != null",
        "field": "field1",
        "value": "{{tmp.string_value}}"
      }
    },
    {
      "remove": {
        "if": "ctx.tmp != null",
        "field": "tmp"
      }
    }
  ]
}

Testing this pipeline, we get this:

POST _ingest/pipeline/enum-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "field1": "ENUMVALUE1"
      }
    },
    {
      "_source": {
        "field1": "ENUMVALUE4"
      }
    }
  ]
}

Results =>

{
  "docs" : [
    {
      "doc" : {
        "_source" : {
          "field1" : "StringValue1"        <--- value has been replaced
        }
      }
    },
    {
      "doc" : {
        "_source" : {
          "field1" : "ENUMVALUE4"          <--- value has NOT been replaced
        }
      }
    }
  ]
}

For the sake of completeness, I'm sharing the other solution without enrich index, so you can test both and use whichever makes most sense for you.

In this second option, we're simply going to use an ingest pipeline with a script processor whose parameters contain a map of your enums. field1 will be replaced by whatever value is mapped to the enum value it contains, or will keep its value if there's no corresponding enum value.

PUT _ingest/pipeline/enum-pipeline
{
  "description": "Enum enriching pipeline",
  "processors": [
    {
      "script": {
        "source": """
          ctx.field1 = params.getOrDefault(ctx.field1, ctx.field1);
        """,
        "params": {
          "ENUMVALUE1": "StringValue1",
          "ENUMVALUE2": "StringValue2",
          ... // add all your enums here
        }
      }
    }
  ]
}

Testing this pipeline, we get this

POST _ingest/pipeline/enum-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "field1": "ENUMVALUE1"
      }
    },
    {
      "_source": {
        "field1": "ENUMVALUE4"
      }
    }
  ]
}

Results =>

{
  "docs" : [
    {
      "doc" : {
        "_source" : {
          "field1" : "StringValue1"        <--- value has been replaced
        }
      }
    },
    {
      "doc" : {
        "_source" : {
          "field1" : "ENUMVALUE4"          <--- value has NOT been replaced
        }
      }
    }
  ]
}

So both solutions would work for your case, you just need to pick up the one that is the best fit. Just know that in the first option, if your enums change, you'll need to rebuild your source index and enrich policy, while in the second case, you just need to modify the parameters map of your pipeline.

  • Related