Home > Software design >  Match string at end of line in file
Match string at end of line in file

Time:10-01

I have two files:

  1. $hashfile: Hashes and ./relative/path/to/file/names, each on a line

  2. $badfiles: ./relative/path/to/file/names that I need to find in $hashfile.

Here is an extract of a $hashfile:

c2c99b59f3303cafac85c2c6df6653cc  ./vm-mount.sh
058a8fb0b9366f248be32b7390e94595  ./Jerusalem_Canon EOS R5_20210601_031.jpg~
23eba1c54846de5244312047e2709f9a  ./rsync-back.sh
ff3f08f7bf45f8e9ef8b33192db3ce9a  ./vm-backup.sh
11e0d980f3b2219f65da97a0318e7dce  ./Jerusalem_Canon EOS R5_20210601_031.jpg
49fb1fb660dce09acd87861a228c899d  ./vm-test.sh

Here is an example of $badfiles containing the search patterns:

./Jerusalem_Canon EOS R5_20210601_031.jpg
./file.txt

I need to search the $hashfile for the patterns inside $badfiles and write the matching lines containing the hash to a third file $new.

So far I've used the following:

grep -Ff "$badfiles" "$hashfile" > "$new"

However, this will match both:

058a8fb0b9366f248be32b7390e94595  ./Jerusalem_Canon EOS R5_20210601_031.jpg~
11e0d980f3b2219f65da97a0318e7dce  ./Jerusalem_Canon EOS R5_20210601_031.jpg

I then added a $ at the end of each line in $badfiles and changed the grep command to:

grep -f "$badfiles" "$hashfile" > "$new"

This worked on a small test folder, but I'm concerned that a pattern search that won't be interpreted as a fixed string can create havoc on large file systems. I have some 300,000 file names and hashes, some of which use special characters like "':,;<>()[]? - in short any character that a Linux ext4 and/or Windows NTFS file system will accept.

Any ideas?

CodePudding user response:

This awk solution should work for you:

awk 'FNR == NR {srch[$0]; next} 
{s = $0; sub(/^[^[:blank:]] [[:blank:]] /, "", s)}
s in srch' badfiles hashfile

11e0d980f3b2219f65da97a0318e7dce  ./Jerusalem_Canon EOS R5_20210601_031.jpg

This solution first stores all lines from badfiles in an array srch. Then from hashfile it removes text until first whitespace and then prints each line from same file if remaining part is found in srch array.

  • Related