I have two files:
$hashfile: Hashes and ./relative/path/to/file/names, each on a line
$badfiles: ./relative/path/to/file/names that I need to find in $hashfile.
Here is an extract of a $hashfile:
c2c99b59f3303cafac85c2c6df6653cc ./vm-mount.sh
058a8fb0b9366f248be32b7390e94595 ./Jerusalem_Canon EOS R5_20210601_031.jpg~
23eba1c54846de5244312047e2709f9a ./rsync-back.sh
ff3f08f7bf45f8e9ef8b33192db3ce9a ./vm-backup.sh
11e0d980f3b2219f65da97a0318e7dce ./Jerusalem_Canon EOS R5_20210601_031.jpg
49fb1fb660dce09acd87861a228c899d ./vm-test.sh
Here is an example of $badfiles containing the search patterns:
./Jerusalem_Canon EOS R5_20210601_031.jpg
./file.txt
I need to search the $hashfile for the patterns inside $badfiles and write the matching lines containing the hash to a third file $new.
So far I've used the following:
grep -Ff "$badfiles" "$hashfile" > "$new"
However, this will match both:
058a8fb0b9366f248be32b7390e94595 ./Jerusalem_Canon EOS R5_20210601_031.jpg~
11e0d980f3b2219f65da97a0318e7dce ./Jerusalem_Canon EOS R5_20210601_031.jpg
I then added a $ at the end of each line in $badfiles and changed the grep command to:
grep -f "$badfiles" "$hashfile" > "$new"
This worked on a small test folder, but I'm concerned that a pattern search that won't be interpreted as a fixed string can create havoc on large file systems. I have some 300,000 file names and hashes, some of which use special characters like "':,;<>()[]? - in short any character that a Linux ext4 and/or Windows NTFS file system will accept.
Any ideas?
CodePudding user response:
This awk
solution should work for you:
awk 'FNR == NR {srch[$0]; next}
{s = $0; sub(/^[^[:blank:]] [[:blank:]] /, "", s)}
s in srch' badfiles hashfile
11e0d980f3b2219f65da97a0318e7dce ./Jerusalem_Canon EOS R5_20210601_031.jpg
This solution first stores all lines from badfiles
in an array srch
. Then from hashfile
it removes text until first whitespace and then prints each line from same file if remaining part is found in srch
array.