Is there a way to iterate over values of a column then check if it's present elsewhere?-CodePudding

I have generated 2 .csv files, one containing the original md5sums of some files in a directory and one containing the md5sums calculated at a specific moment.

md5_original.csv
----------

        $1                      $2  $3
7815696ecbf1c96e6894b779456d330e,,s1.txt
912ec803b2ce49e4a541068d495ab570,,s2.txt
040b7cf4a55014e185813e0644502ea9,,s64.txt
8a0b67188083b924d48ea72cb187b168,,b43.txt

etc.

md5_$current_date.csv
----------

        $1                      $2  $3
7815696ecbf1c96e6894b779456d330e,,s1.txt
4d4046cae9e9bf9218fa653e51cadb08,,s2.txt
3ff22b3585a0d3759f9195b310635c29,,b43.txt

etc.
* some files could be deleted when calculating current md5sums

I am looking to iterate over the values of column $3 in md5_$current_date.csv and, for each value of that column, to check if it exists in the md5_original.csv and if so, finally to compare its value on $1.

Output should be:

s2.txt hash changed from 912ec803b2ce49e4a541068d495ab570 to 4d4046cae9e9bf9218fa653e51cadb08.
b43.txt hash changed from 8a0b67188083b924d48ea72cb187b168 to 3ff22b3585a0d3759f9195b310635c29.

I have written the script for building this two .csv files, but I am struggling to the awk part where I have to do what I have asked above. I don't know if there is a better way to do this, I am a newbie.

CodePudding user response：

I would use GNU AWK for this task following way, let md5_original.csv content be

7815696ecbf1c96e6894b779456d330e {BLANK_COLUMN} s1.txt
912ec803b2ce49e4a541068d495ab570 {BLANK_COLUMN} s2.txt
040b7cf4a55014e185813e0644502ea9 {BLANK_COLUMN} s64.txt
8a0b67188083b924d48ea72cb187b168 {BLANK_COLUMN} b43.txt

and md5_current.csv content be

7815696ecbf1c96e6894b779456d330e {BLANK_COLUMN} s1.txt
4d4046cae9e9bf9218fa653e51cadb08 {BLANK_COLUMN} s2.txt
3ff22b3585a0d3759f9195b310635c29 {BLANK_COLUMN} b43.txt

then

awk 'FNR==NR{arr[$3]=$1;next}($3 in arr)&&($1 != arr[$3]){print $3 " hash changed from " arr[$3] " to " $1}' md5_original.csv md5_current.csv

output

s2.txt hash changed from 912ec803b2ce49e4a541068d495ab570 to 4d4046cae9e9bf9218fa653e51cadb08
b43.txt hash changed from 8a0b67188083b924d48ea72cb187b168 to 3ff22b3585a0d3759f9195b310635c29

Explanation: FNR is number of row in current file, NR is number of row globally, these are equal only when processing 1st file. When processing 1st file I create array arr so keys are filenames and values are corresponding hash values, next cause GNU AWK to go to next line i.e. no other action is undertaken, so rest is applied only for all but first file. ($3 in arr) is condition: is current $3 one of keys of arr? If it does hold true I print concatenation of current $3 (that is filename) hash changed from string value for key $3 from array arr (that is old hash value) to string $1 (current hash value). If given filename is not present in array arr then no action is undertaken. Edit: added exclusion for hash which not changed as suggested in comment.

(tested in gawk 4.2.1)