The below code searches for a set of patterns (contained in the $snps variable) within multiple files ($file variable for files ending in snp_search.txt) and outputs a long list of whether or not each snp is in each file.
The purpose is to find several SNPs that are in all of the files.
Is there a way to embed the below code in a while loop so that the it keeps running until it finds a SNP that is in all of the files and breaks when it does? Otherwise I have to check the log file manually.
for snp in $snplist; do
for file in *snp_search.txt; do
if grep -wq "$snp" $file; then
echo "${snp} was found in $file" >> ${date}_snp_search.log; else
echo "${snp} was NOT found in $file" >> ${date}_snp_search.log
fi
done
done
CodePudding user response:
You can use grep
to search all the files. If the file names don't contain newlines, you can just count the number of matching files directly:
#! /bin/bash
files=(*snp_search.txt)
count_files=${#files[@]}
for snp in $snplist ; do
count=$(grep -wl "$snp" *snp_search.txt | wc -l)
if ((count == count_files)) ; then
break
fi
done
For file names containing newlines, you can output the first matching line for each $snp without the file name and count the lines:
count=$(grep -m1 -hw "$snp" *snp_search.txt | wc -l)
CodePudding user response:
Assumptions:
- multiple SNPs may exist in a single line of an input file
- will print a list of all SNPs that exist in all files (OP has mentioned contradicting statements:
find several SNPs that are in all of the files
vsbreak when one SNP is found in all files
)
Sample inputs (will update if OP updates question with sample data):
$ cat snp.dat
ABC
DEF
XYZZ
$ cat 1.snp.search.txt
ABCD-XABC
someABC_stuff
ABC-
de-ABC-
de-ABC
DEFG
zDEFG
.DEF-xyz
abc-DEF
abc-DEF-ABC-xyz
$ cat 2.snp.search.txt
ABC
One GNU awk
idea that requires a single pass through each input file:
awk '
FNR==NR { snps[$1]=0; next } # load 1st file into array; initialize counter (of files containing this snp) to 0
FNR==1 { filecount # 1st line of 2nd-nth files: increment counter of number of filds
delete to_find # delete our to_find[] array
for (snp in snps) # make a copy of our master snps[] array ...
to_find[snp] # storing copy in to_find[] array
}
{ for (snp in to_find) { # loop through list of snps
if ($0 ~ "\\y" snp "\\y") { # if current line contains a "word" match on the current snp ...
snps[snp] # increment our snp counter (ie, number of files containing this snp)
delete to_find[snp] # no longer need to search current file for this particular snp
# break # if line can only contain 1 snp then uncomment this line
}
}
for (snp in to_find) # if we still have an snp to find then ...
next # skip to next line else ...
nextfile # skip to next file
}
END { PROCINFO["sorted_in"]="@ind_str_asc"
for (snp in snps)
if (snps[snp] == filecount)
printf "The SNP %s was found in all files\n", snp
}
' snp.dat *.snp.search.txt
NOTES:
GNU awk
is required for thePROCINFO["sorted_in"]="@ind_str_asc"
option to sort thesnps[]
array indices; ifGNU awk
is not available, or ordering of output messages is not important, then this command can be removed from the code- since we only process each input file once we will print all SNPs that show up in all files (ie, we won't know if a SNP exists in all files until we've processed the last file so might as well print all SNPs that exist in all fiels)
- should be faster than processes that require multiple scans of each input file (especially for larger files and/or a large number of SNPs)
This generates:
The SNP ABC was found in all files