Home > Software design >  Elastic Search reindex API - how to preserve destination index mapping?
Elastic Search reindex API - how to preserve destination index mapping?

Time:10-22

Let my-index-0 be an ES index with an alias of my-index.

It has the following mapping:

{
    "my-index-0": {
        "aliases": {
            "my-index": {}
        },
        "mappings": {
            "doc": {
                "properties": {
                    "foo": {
                        "properties": {
                            "fizz": {
                                "type": "keyword"
                            },
                            "baz": {
                                "type": "keyword"
                            }
                        }
                    }
                }
            }
        }
    }
}

Let's say I want to remove the baz field from foo. I'm using the following steps:

  1. Create a new index my-index-1 with updated mapping (foo.baz removed) using PUT /my-index-1
{
    "mappings": {
        "doc": {
            "properties": {
                "foo": {
                    "properties": {
                        "fizz": {
                            "type": "keyword"
                        },
                    }
                }
            }
        }
    }
}
  1. Reindex data from my-index-0 to my-index-1 using POST /_reindex
{
  "source": {
    "index": "my-index-0"
  },
  "dest": {
    "index": "my-index-1"
  }
}
  1. Move the my-index alias to the my-index-1 index using POST /_aliases
{
    "actions": [
        {"remove": {"index": "my-index-0", "alias": "my-index"}},
        {"add": {"index": "my-index-1", "alias": "my-index"}},
    ]
}

Expected result

Data in the new index does not have the foo.baz property.

Actual result

On my-index-1 creation, its mapping does not contain the foo.baz field, however, after re-indexation, my-index-1's mapping is changed to the old index' mapping.

Note: _source can be used for simple fields removal

If one wants to remove a field, for example, removal of bar from the mapping below

{
    "mappings": {
        "foo": {
            "type": "text"
        },
        "bar": {
            "type": "text"
        }
    }
}

it is sufficient to provide the _source param without the bar field in the request to reindex API:

{
  "source": {
    "index": "my-index-0",
    "_source": ["foo"]
  },
  "dest": {
    "index": "my-index-1"
  }
}

How to achieve the same with a nested structure?

CodePudding user response:

When you use reindex ES tries to copy all data from source to destination index. If you want to make your index to not to be modified you need to add this line to your mapping:

"dynamic" : "strict"

Now if you want to reindex data you will get an error "strict_dynamic_mapping_exception" because "mapping set to strict, dynamic introduction of [baz] within [foo] is not allowed". So you need to delete this field in your reindex like this:

POST _reindex
{
  "source": {
    "index": "my-index-0"
  },
  "dest": {
    "index": "my-index-1"
  },
  "script": {
    "source": "ctx._source.remove(\"foo.baz\")"
  }
}

Note: adding "dynamic" : "strict" is optional and prevents your index from modifying. It will work for you if you just edit your reindex query.

CodePudding user response:

I think I've found the generic solution I was looking for.

In the _source attribute, one can specify explicitly every nested field, therefore, the _source value for the scenario in the example should be ["foo.fiz"] - note the lack of "foo.bar" which shouldn't be copied.

{
  "source": {
    "index": "my-index-0",
    "_source": ["foo.fiz"]
  },
  "dest": {
    "index": "my-index-1"
  }
}

Essentially, the problem of generating the "_source" attribute for a generic case, can be reduced to finding the intersection of sets of all property paths for old and new mappings.

Python solution

The function below Recursively iterate through properties and yield all property paths.

def get_property_path(properties: dict[str, Any], name: str = "") -> Iterator[str]:
    for property_name, property_value in properties.items():
        new_name = f"{name}.{property_name}" if name else property_name
        if nested_properties := property_value.get("properties"):
            yield from get_property_path(nested_properties, new_name)
        else:
            yield new_name

for example

>>> properties = {
    "a": {
        "properties": {
            "b": {
                "properties": {
                    "c": {"type": "text"},
                },
            },`
        },
    },
    "e": {
        "properties": {
            "f": {"type": "text"},
        },
    },
}
>>> list(get_property_path(properties))
>>> ['a.b.c', 'e.f']

It can be later used to calculate the set of fields that should be copied (fields that are both in old and new mapping):

_source = list(
    set(get_property_path(old_mapping["properties"]))
    & set(get_property_path(new_mapping["properties"]))
)

I won't accept my answer tho, as there might be a simpler solution that is based on the ES API.

  • Related