How to change mapping type on a very large ElasticSearch index?-CodePudding

I want to change an ElasticSearch mapped property from text to ip, but they are incompatible types.

I know the accepted answer is to:

create a new index with the correct mapping;
run the _reindex API to copy the data over;
make a cup of coffee;
delete the original index;
create an alias from the original to the new index.

However, I'm dealing with a half-a-billion records in a 363GB index, with potentially hundreds of thousands of new records being written to it every day. The process above would involve at least some down-time during the process of deleting the old index and creating the new alias, and there would be lost records in the time between the reindexing completing and the next step. (And the reindexing process would take hours, if not days.) It would be much better to do the transform in-place, perhaps creating a new field and copying the old index over? Any ideas?

CodePudding user response：

It's actually much easier than that. The process is as follows:

A. Modify your mapping to add an ip sub-field to your existing field, e.g.

PUT index/_mapping
{
  "properties": {
    "your_ip_field": {
      "type": "text",
      "fields": {
        "ip": {
          "type": "ip"
        }
      }
    }
  }
}

B. Call index/_update_by_query?wait_for_completion=false&slices=auto on your index

C. Nothing here, that's all there is to it.

The call to _update_by_query will simply take each _source document and reindex it upon itself (i.e. new version) and the ip sub-field will then be available for your queries.

All new documents being written to your index every day will already use the new mapping and the old ones will be updated during the update call. No alias switching needed, no reindex, either.

When the update is finished, you'll be able to reference the your_ip_field.ip field field to use the ip version of it.

The downside of this is that you'll still have the your_ip_field as tokenized text in your inverted index which you might not need, but you can't have it all. There are more complex solutions to make it right, but this one is easy and allows you to get going.

CodePudding user response：

Indeed re-indexing such a big index is not a good idea I would recommend to create a new field and use ingest pipeline to map it to new field and do the changes in application side for getting the correct data type.(you can use update_byQuery to add the new field)

Another option would be to use Runtime fields(Introduced in ES 7.11) in this case your parsing would be switched form index time to query time.

Regarding re-indexing you can implement zero-downtime reindexing in ES by using read and write aliases for more information check here.

CodePudding user response：

Thanks to @Kaveh I used the ingest pipeline/update_by_query solution.

I'll describe the solution as an answer for anyone who is looking for it.

0. I'm converting a keyword field named "user_ip" to an IP field named "ip_addr" on "my_index".

You can do most of this stuff through Kibana, but I'll use the ElasticSearch Dev Tools Console style so you can run it all there if you like.

1. Create the ingest pipeline

This was simple to do with Kibana, as per these instructions. The ingest pipeline is named ip_transform. If you want to run it as code, it looks as follows:

PUT _ingest/pipeline/ip_transform
{
  "processors": [
     {
       "convert": {
         "field": "user_ip",
         "type": "ip",
         "target_field": "ip_addr",
         "ignore_missing": true
       }
    }
  ]
}

2. Create the field in the index

PUT my_index/_mapping
{
  "properties": {
    "ip_addr": {
      "type": "ip"
    }
  }
}

3. Set this as the default pipeline for this index

This isn't strictly necessary, you could add ?pipeline=ip_transform to your calls, but I didn't want to change my code and I always want to run this pipeline.

PUT my_index/_settings
{
  "index.default_pipeline": "ip_transform"
}

4. Run the pipeline on existing records

We use the _update_by_query API to run the pipeline against each existing record.

POST my_index/_update_by_query?pipeline=ip_transform&wait_for_completion=false
{
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "ip_addr"
                }
            }
        }
    }
}

This is going to take a while with a big data set, but you can take the returned ID and query whether it's complete:

GET _tasks/<Job ID>