I want to change an ElasticSearch mapped property from text
to ip
, but they are incompatible types.
I know the accepted answer is to:
- create a new index with the correct mapping;
- run the
_reindex
API to copy the data over; - make a cup of coffee;
- delete the original index;
- create an alias from the original to the new index.
However, I'm dealing with a half-a-billion records in a 363GB index, with potentially hundreds of thousands of new records being written to it every day. The process above would involve at least some down-time during the process of deleting the old index and creating the new alias, and there would be lost records in the time between the reindexing completing and the next step. (And the reindexing process would take hours, if not days.) It would be much better to do the transform in-place, perhaps creating a new field and copying the old index over? Any ideas?
CodePudding user response:
It's actually much easier than that. The process is as follows:
A. Modify your mapping to add an ip
sub-field to your existing field, e.g.
PUT index/_mapping
{
"properties": {
"your_ip_field": {
"type": "text",
"fields": {
"ip": {
"type": "ip"
}
}
}
}
}
B. Call index/_update_by_query?wait_for_completion=false&slices=auto
on your index
C. Nothing here, that's all there is to it.
The call to _update_by_query
will simply take each _source
document and reindex it upon itself (i.e. new version) and the ip
sub-field will then be available for your queries.
All new documents being written to your index every day will already use the new mapping and the old ones will be updated during the update call. No alias switching needed, no reindex, either.
When the update is finished, you'll be able to reference the your_ip_field.ip
field field to use the ip
version of it.
The downside of this is that you'll still have the your_ip_field
as tokenized text in your inverted index which you might not need, but you can't have it all. There are more complex solutions to make it right, but this one is easy and allows you to get going.
CodePudding user response:
Indeed re-indexing such a big index is not a good idea I would recommend to create a new field and use ingest pipeline to map it to new field and do the changes in application side for getting the correct data type.(you can use update_byQuery to add the new field)
Another option would be to use Runtime fields(Introduced in ES 7.11) in this case your parsing would be switched form index time to query time.
Regarding re-indexing you can implement zero-downtime reindexing in ES by using read and write aliases for more information check here.
CodePudding user response:
Thanks to @Kaveh I used the ingest pipeline/update_by_query solution.
I'll describe the solution as an answer for anyone who is looking for it.
0. I'm converting a keyword field named "user_ip" to an IP field named "ip_addr" on "my_index".
You can do most of this stuff through Kibana, but I'll use the ElasticSearch Dev Tools Console style so you can run it all there if you like.
1. Create the ingest pipeline
This was simple to do with Kibana, as per these instructions. The ingest pipeline is named ip_transform
. If you want to run it as code, it looks as follows:
PUT _ingest/pipeline/ip_transform
{
"processors": [
{
"convert": {
"field": "user_ip",
"type": "ip",
"target_field": "ip_addr",
"ignore_missing": true
}
}
]
}
2. Create the field in the index
PUT my_index/_mapping
{
"properties": {
"ip_addr": {
"type": "ip"
}
}
}
3. Set this as the default pipeline for this index
This isn't strictly necessary, you could add ?pipeline=ip_transform
to your calls, but I didn't want to change my code and I always want to run this pipeline.
PUT my_index/_settings
{
"index.default_pipeline": "ip_transform"
}
4. Run the pipeline on existing records
We use the _update_by_query API to run the pipeline against each existing record.
POST my_index/_update_by_query?pipeline=ip_transform&wait_for_completion=false
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "ip_addr"
}
}
}
}
}
This is going to take a while with a big data set, but you can take the returned ID and query whether it's complete:
GET _tasks/<Job ID>