Home > Back-end >  Moving specific Index Data into a new Index within Elasticsearch
Moving specific Index Data into a new Index within Elasticsearch

Time:12-16

I have several million docs, that I need to move into a new index, but there is a condition on which docs should flow into the index. Say I have a field named, offsets, that needs to be queried against. The values I need to query for are: [1,7,99,32, ....., 10000432] (very large list) in the offset field..

Does anyone have thoughts on how I can move the specific docs, with those values in the list into a new elasticsearch index.? My first though was reindexing with a query, but there is no pattern for the offsets list..

Would it be a python loop appending each doc to a new index? Looking for any guidance. Thanks

CodePudding user response:

Are the documents really large, or can you add them into an jsonl file for bulk ingestion? In what form is the selector list, the one shown as "[1,7,99,32, ....., 10000432]"?

I'd do it in Pandas, but here is an idea in ES parlance. Whatever you do, do use the _bulk API, or the job will never finish.

You can run a query based upon as file as per GET my_index/_search?_file="myquery_file"

You can put all the ids into a file, myquery_file, as below:

{
  "query": {
    "ids" : {
      "values" : ["1", "4", "100"]
    }
  },
  "format": "jsonl"
}

and output as jsonl to ingest.
You can do the above for the reindex API.

{
  "source": {
    "index": "source",
    **"query": {
      "match": {
        "company": "cat"
      }
    }**
  },
  "dest": {
    "index": "dest",
    "routing": "=cat"
  }
}

CodePudding user response:

Unfortunately,

I was facing a time crunch, and had to throw in a personalized loop to query a very specific subset of indices..

df = pd.read_csv('C://code//part_1_final.csv')


offsets = df['OFFSET'].tolist()
# Offsets are the "unique" values I need to identify the docs by.. There is no pattern in these values, thus I must go one by one.. 

missedDocs = []



for i in offsets:
    print(i)
    try: 
        client.reindex({
            "source": {
                "index": "<source_index>,
                "query": {
                "bool": {
                    "must": [
                        { "match" : {"<index_filed_1>": "1" }}, 
                        { "match" : {"<index_with_that_needs_values_to_match": i }}

                    ]

                }
                }
            },
            "dest": {
                "index": "<dest_index>"
            }
        })
    except KeyError: 

        print('error')
        #missedDocs.append(query)
        print('DOC ERROR')


  • Related