How to skip monogorestore's `E11000 duplicate key error collection` errors when continuing to r-CodePudding

I'm trying to restore a 200GB MongoDB dump created by others (the Tuples DB from here to be precise: http://webdatacommons.org/isadb/) but at some point the mongod process aborts so that I've only managed to restore about 70GB of it so far. My problem is that when I restart the mongod and mongorestore processes, mongorestore starts by trying to insert all the tuples again that it already did (continuing through error: E11000 duplicate key error collection: tuplesdb.cco index: _id_ dup key: { _id: 2 } and so on; when redirecting it to a text file it's over 30GB of these error messages until it crashes).

Now, is there a way to find out which parts of the dump have already been restored and to tell mongorestore to skip those? Or is there another better way to restore a big MongoDB?

I've used the follwing two commands:

nohup mongorestore tuples-webisadb-april-2016 > mongorestore.out 2> mongorestore.err < /dev/null

nohup mongod --dbpath /data/webisadb/mongodb > mongod.out 2> mongod.err < /dev/null

I've read about mongorestore's --drop parameter but that's not what I need. Inserting the tuples again instead of seeing for each one that it's already there is not going to solve my problem.

Thanks for the help!

CodePudding user response：

Now, is there a way to find out which parts of the dump have already been restored and to tell mongorestore to skip those?

No.

another better way to restore a big MongoDB

Also no, I recommend 2 things however:

You did not give any details on why the process "crashes", I would investigate this, is it because it's out of memory? or is a disk related issue? is the data corrupted? you must understand why this process is crashing so you can tend to the actual issue.
Make it easier for the process, the hardest part for Mongo during these restores are index building, I recommend you use the --noIndexRestore flag and rebuild the indexes manually after the process is done. You'd be quite suprised to see the different in performance once indexes are out of the game. (make sure to drop the indexes from the existing collection as well as you already have partial data inserted).
It's not clear if this dump contains multiple collections or not, but if it does I recommend running them separately.

If you do decide to investigate the "crash" issue I'd be happy to help investigate this with you.