Efficiently check duplicates while uploading batch data with pymongo-CodePudding

I have a MongoDB collection that is sequentially updated using batches of DataFrames:

print(batch_df_0)

id      date     shop  product
1   28/10/2021    1     apple
2   28/10/2021    2     apple
3   28/10/2021    3     apple

##################
# MongoDB Update #
##################

print(batch_df_1)

id      date     shop  product
1   28/10/2021    1     apple # not to be uploaded, since already in DB
1   29/10/2021    1     apple # OK
1   29/10/2021    1     banana # OK, since product is not key
10  29/10/2021    1     apple # OK

1   29/10/2021    2     banana # OK
1   29/10/2021    3     apple # OK

print(batch_df_1_to_be_updated)

id      date     shop  product
1   29/10/2021    1     apple
1   29/10/2021    1     banana
10  29/10/2021    1     apple
1   29/10/2021    2     banana
1   29/10/2021    3     apple

##################
# MongoDB Update #
##################

I would like to ensure that I don't upload twice the same row (e.g. 1 28/10/2021 1 apple from batch_df_1, that was already present within batch_df_0), given "id", "date" and "shop" as DB keys over which duplicates should be controlled.

So far, I have tried setting a compound index as:

compound_index = [('id', 1), ('date', 1), ('shop', 1)]
collection.create_index(compound_index, unique=True)

insert_result = collection.insert_many(batch_df_1.to_dict("records"))

However, it stops the upload as soon as a duplicate is found.

Is there an efficient way to both ensure duplicates check, for each row of the DataFrame, without stopping the whole DataFrame upload?

CodePudding user response：

Pass ordered=False to the insert_many operation so that it will continue to push other documents and throw exception in the end.

try:
    insert_result = collection.insert_many(batch_df_1.to_dict("records"), ordered=False)
except:
    # Ignore Error