Checksums comparison on linux-CodePudding

I have two large files with 200k and 100k lines. The files contain two columns each: the checksum and the path from which it was taken. The first file contains paths, half of which are in the second file and half of which are not. My goal is to compare checksums of files along their path.

I tried to use diff but it doesn't work correctly on all lines. Then I wrote a script to compare file paths first, then checksums if the paths matched. But with such a large number of lines, the script takes an incredibly long time to complete.

#!/bin/bash

IFS=$'\n'
del=$' '

while read LineG
do
        while read LineA
        do
                if [ ${LineG#*$del} = ${LineA#*$del}  ]
                then
                        if [ ${LineG%%$del*} != ${LineA%%$del*}  ]; then
                                printf "%s\n%s\n\n" $LineG $LineA >> "./diff.txt"
                        fi
                        break
                fi
        done < $2
done < $1

How can I solve this problem? How can the process be optimized to run faster?

CodePudding user response：

I would:

# Join on filenames
join -j2 -o 1.1,2.1,1.2 <(sort -k2 file1) <(sort -k2 file2)
# Print filenames with mismatch checksum.
awk '$1 != $2{ print $3 }'