Calculate md5 on a single 1T file, or on 100 10G files, which one is faster? Or the speed are the sa-CodePudding

I have a huge 1T file on my local machine and one on the remote server. I need to calculate their md5 to check if they are exactly the same. Since it will take long time to calculate md5 from them, I want to do some research on the md5 speed. I can calculate md5 directly against the whole file, or split it into 100 10G files and calculate md5 on them. I want to know which one is faster, or will they have the same speed?

CodePudding user response：

As I was trying to say in the comments, it will depend on lots of things like the speed of your disk subsystem, your CPU performance and so on.

Here is an example. Create a 120GB file and check its size:

dd if=/dev/random of=junk bs=1g count=120

ls -lh junk
-rw-r--r--  1 mark  staff   120G  5 Oct 13:34 junk

Checksum in one go:

time md5sum junk
3c8fb0d5397be5a8b996239f1f5ce2f0  junk

real    3m55.713s       <--- 4 minutes
user    3m28.441s
sys     0m24.871s

Checksum in 10GB chunks, with 12 CPU cores in parallel:

time parallel -k --pipepart --recend '' --recstart '' --block 10G -a junk md5sum
29010b411a251ff467a325bfbb665b0d  -
793f02bb52407415b2bfb752827e3845  -
bf8f724d63f972251c2973c5bc73b68f  -
d227dcb00f981012527fdfe12b0a9e0e  -
5d16440053f78a56f6233b1a6849bb8a  -
dacb9fb1ef2b564e9f6373a4c2a90219  -
ba40d6e7d6a32e03fabb61bb0d21843a  -
5a5ee62d91266d9a02a37b59c3e2d581  -
95463c030b73c61d8d4f0e9c5be645de  -
4bcd7d43849b65d98d9619df27c37679  -
92bc1f80d35596191d915af907f4d951  -
44f3cb8a0196ce37c323e8c6215c7771  -

real    1m0.046s      <--- 1 minute
user    4m51.073s
sys     3m51.335s

It takes 1/4 of the time on my machine, but your mileage will vary... depending on your disk subsystem, your CPU etc.