Convert two repeated values in array into a string-CodePudding

I have some old documents where a field has an array of two vales repeated, something like this:

          "task" : [
            "first_task",
            "first_task"
          ],

I'm trying to convert this array into a string because it's the same value. I've seen the following script: Convert array with 2 equal values to single value but in my case, this problem can't be fixed through logstash because it happens just with old documents stored.

I was thinking to do something like this:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "description": "Change task field from array to first element of this one",
          "lang": "painless",
          "source": """
            if (ctx['task'][0] == ctx['task'][1]) {
                ctx['task'] = ctx['task'][0];
            }
          """
        }
      }
    ]
  },
  "docs": [
    {
        "_index" : "tasks",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "@timestamp" : "2022-05-03T07:33:44.652Z",
          "task" : ["first_task", "first_task"]
        }
    }
  ]
}

The result document is the following:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "tasks",
        "_type" : "_doc",
        "_id" : "1",
        "_source" : {
          "@timestamp" : "2022-05-03T07:33:44.652Z",
          "task" : "first_task"
        },
        "_ingest" : {
          "timestamp" : "2022-05-11T09:08:48.150815183Z"
        }
      }
    }
  ]
}

We can see the task field is reassigned and we have the first element of the array as a value.

Is there a way to manipulate actual data from Elasticsearch and convert all the documents with this characteristic using DSL queries?

Thanks.

CodePudding user response：

You can achieve this with _update_by_query endpoint. Here is an example:

POST tasks/_update_by_query
{
  "script": {
    "source": """
      if (ctx._source['task'][0] == ctx._source['task'][1]) {
          ctx._source['task'] = ctx._source['task'][0];
      }
    """,
    "lang": "painless"
  },
  "query": {
    "match_all": {}
  }
}

You can remove the match_all query if you want to update all documents or you can filter documents by chaning the conditions in the query.

Keep in mind that running a script to update all documents in the index may cause some performance issues while the update process is running.