I have a huge 1T file on my local machine and one on the remote server. I need to calculate their md5 to check if they are exactly the same. Since it will take long time to calculate md5 from them, I want to do some research on the md5 speed. I can calculate md5 directly against the whole file, or split it into 100 10G files and calculate md5 on them. I want to know which one is faster, or will they have the same speed?
CodePudding user response:
As I was trying to say in the comments, it will depend on lots of things like the speed of your disk subsystem, your CPU performance and so on.
Here is an example. Create a 120GB file and check its size:
dd if=/dev/random of=junk bs=1g count=120
ls -lh junk
-rw-r--r-- 1 mark staff 120G 5 Oct 13:34 junk
Checksum in one go:
time md5sum junk
3c8fb0d5397be5a8b996239f1f5ce2f0 junk
real 3m55.713s <--- 4 minutes
user 3m28.441s
sys 0m24.871s
Checksum in 10GB chunks, with 12 CPU cores in parallel:
time parallel -k --pipepart --recend '' --recstart '' --block 10G -a junk md5sum
29010b411a251ff467a325bfbb665b0d -
793f02bb52407415b2bfb752827e3845 -
bf8f724d63f972251c2973c5bc73b68f -
d227dcb00f981012527fdfe12b0a9e0e -
5d16440053f78a56f6233b1a6849bb8a -
dacb9fb1ef2b564e9f6373a4c2a90219 -
ba40d6e7d6a32e03fabb61bb0d21843a -
5a5ee62d91266d9a02a37b59c3e2d581 -
95463c030b73c61d8d4f0e9c5be645de -
4bcd7d43849b65d98d9619df27c37679 -
92bc1f80d35596191d915af907f4d951 -
44f3cb8a0196ce37c323e8c6215c7771 -
real 1m0.046s <--- 1 minute
user 4m51.073s
sys 3m51.335s
It takes 1/4 of the time on my machine, but your mileage will vary... depending on your disk subsystem, your CPU etc.