I have two large files with 200k and 100k lines. The files contain two columns each: the checksum and the path from which it was taken. The first file contains paths, half of which are in the second file and half of which are not. My goal is to compare checksums of files along their path.
I tried to use diff but it doesn't work correctly on all lines. Then I wrote a script to compare file paths first, then checksums if the paths matched. But with such a large number of lines, the script takes an incredibly long time to complete.
#!/bin/bash
IFS=$'\n'
del=$' '
while read LineG
do
while read LineA
do
if [ ${LineG#*$del} = ${LineA#*$del} ]
then
if [ ${LineG%%$del*} != ${LineA%%$del*} ]; then
printf "%s\n%s\n\n" $LineG $LineA >> "./diff.txt"
fi
break
fi
done < $2
done < $1
How can I solve this problem? How can the process be optimized to run faster?
CodePudding user response:
I would:
# Join on filenames
join -j2 -o 1.1,2.1,1.2 <(sort -k2 file1) <(sort -k2 file2)
# Print filenames with mismatch checksum.
awk '$1 != $2{ print $3 }'