I have a MongoDB collection that is sequentially updated using batches of DataFrames:
print(batch_df_0)
id date shop product
1 28/10/2021 1 apple
2 28/10/2021 2 apple
3 28/10/2021 3 apple
##################
# MongoDB Update #
##################
print(batch_df_1)
id date shop product
1 28/10/2021 1 apple # not to be uploaded, since already in DB
1 29/10/2021 1 apple # OK
1 29/10/2021 1 banana # OK, since product is not key
10 29/10/2021 1 apple # OK
1 29/10/2021 2 banana # OK
1 29/10/2021 3 apple # OK
print(batch_df_1_to_be_updated)
id date shop product
1 29/10/2021 1 apple
1 29/10/2021 1 banana
10 29/10/2021 1 apple
1 29/10/2021 2 banana
1 29/10/2021 3 apple
##################
# MongoDB Update #
##################
I would like to ensure that I don't upload twice the same row (e.g. 1 28/10/2021 1 apple from batch_df_1, that was already present within batch_df_0), given "id", "date" and "shop" as DB keys over which duplicates should be controlled.
So far, I have tried setting a compound index as:
compound_index = [('id', 1), ('date', 1), ('shop', 1)]
collection.create_index(compound_index, unique=True)
insert_result = collection.insert_many(batch_df_1.to_dict("records"))
However, it stops the upload as soon as a duplicate is found.
Is there an efficient way to both ensure duplicates check, for each row of the DataFrame, without stopping the whole DataFrame upload?
CodePudding user response:
Pass ordered=False
to the insert_many
operation so that it will continue to push other documents and throw exception in the end.
try:
insert_result = collection.insert_many(batch_df_1.to_dict("records"), ordered=False)
except:
# Ignore Error