Moving specific Index Data into a new Index within Elasticsearch-CodePudding

I have several million docs, that I need to move into a new index, but there is a condition on which docs should flow into the index. Say I have a field named, offsets, that needs to be queried against. The values I need to query for are: [1,7,99,32, ....., 10000432] (very large list) in the offset field..

Does anyone have thoughts on how I can move the specific docs, with those values in the list into a new elasticsearch index.? My first though was reindexing with a query, but there is no pattern for the offsets list..

Would it be a python loop appending each doc to a new index? Looking for any guidance. Thanks

CodePudding user response：

Are the documents really large, or can you add them into an jsonl file for bulk ingestion? In what form is the selector list, the one shown as "[1,7,99,32, ....., 10000432]"?

I'd do it in Pandas, but here is an idea in ES parlance. Whatever you do, do use the _bulk API, or the job will never finish.

You can run a query based upon as file as per GET my_index/_search?_file="myquery_file"

You can put all the ids into a file, myquery_file, as below:

{
  "query": {
    "ids" : {
      "values" : ["1", "4", "100"]
    }
  },
  "format": "jsonl"
}

and output as jsonl to ingest.
You can do the above for the reindex API.

{
  "source": {
    "index": "source",
    **"query": {
      "match": {
        "company": "cat"
      }
    }**
  },
  "dest": {
    "index": "dest",
    "routing": "=cat"
  }
}

CodePudding user response：

Unfortunately,

I was facing a time crunch, and had to throw in a personalized loop to query a very specific subset of indices..

df = pd.read_csv('C://code//part_1_final.csv')


offsets = df['OFFSET'].tolist()
# Offsets are the "unique" values I need to identify the docs by.. There is no pattern in these values, thus I must go one by one.. 

missedDocs = []



for i in offsets:
    print(i)
    try: 
        client.reindex({
            "source": {
                "index": "<source_index>,
                "query": {
                "bool": {
                    "must": [
                        { "match" : {"<index_filed_1>": "1" }}, 
                        { "match" : {"<index_with_that_needs_values_to_match": i }}

                    ]

                }
                }
            },
            "dest": {
                "index": "<dest_index>"
            }
        })
    except KeyError: 

        print('error')
        #missedDocs.append(query)
        print('DOC ERROR')