While loop to break when pattern is found in all files?-CodePudding

The below code searches for a set of patterns (contained in the $snps variable) within multiple files ($file variable for files ending in snp_search.txt) and outputs a long list of whether or not each snp is in each file.

The purpose is to find several SNPs that are in all of the files.

Is there a way to embed the below code in a while loop so that the it keeps running until it finds a SNP that is in all of the files and breaks when it does? Otherwise I have to check the log file manually.

for snp in $snplist; do
   for file in *snp_search.txt; do

     if grep -wq "$snp" $file; then
       echo "${snp} was found in $file" >> ${date}_snp_search.log; else
       echo "${snp} was NOT found in $file" >> ${date}_snp_search.log
     fi
   done
done

CodePudding user response：

You can use grep to search all the files. If the file names don't contain newlines, you can just count the number of matching files directly:

#! /bin/bash
files=(*snp_search.txt)
count_files=${#files[@]}
for snp in $snplist ; do
    count=$(grep -wl "$snp" *snp_search.txt | wc -l)
    if ((count == count_files)) ; then
        break
    fi
done

For file names containing newlines, you can output the first matching line for each $snp without the file name and count the lines:

count=$(grep -m1 -hw "$snp" *snp_search.txt | wc -l)

CodePudding user response：

Assumptions:

multiple SNPs may exist in a single line of an input file
will print a list of all SNPs that exist in all files (OP has mentioned contradicting statements: find several SNPs that are in all of the files vs break when one SNP is found in all files)

Sample inputs (will update if OP updates question with sample data):

$ cat snp.dat
ABC
DEF
XYZZ

$ cat 1.snp.search.txt

ABCD-XABC
someABC_stuff
ABC-
de-ABC-
de-ABC
DEFG
zDEFG
.DEF-xyz
abc-DEF
abc-DEF-ABC-xyz

$ cat 2.snp.search.txt
ABC

One GNU awk idea that requires a single pass through each input file:

awk '
FNR==NR { snps[$1]=0; next }                        # load 1st file into array; initialize counter (of files containing this snp) to 0

FNR==1  { filecount                                 # 1st line of 2nd-nth files: increment counter of number of filds
          delete to_find                            # delete our to_find[] array
          for (snp in snps)                         # make a copy of our master snps[] array ...
               to_find[snp]                         # storing copy in to_find[] array
        }

        { for (snp in to_find) {                    # loop through list of snps 
              if ($0 ~ "\\y" snp "\\y") {           # if current line contains a "word" match on the current snp ...
                 snps[snp]                          # increment our snp counter (ie, number of files containing this snp)
                 delete to_find[snp]                # no longer need to search current file for this particular snp
#                break                              # if line can only contain 1 snp then uncomment this line
              }
          }

          for (snp in to_find)                      # if we still have an snp to find then ...
              next                                  # skip to next line else ...
          nextfile                                  # skip to next file
        }

END     { PROCINFO["sorted_in"]="@ind_str_asc"
          for (snp in snps)
              if (snps[snp] == filecount)
                 printf "The SNP %s was found in all files\n", snp
        }
' snp.dat *.snp.search.txt

NOTES:

GNU awk is required for the PROCINFO["sorted_in"]="@ind_str_asc" option to sort the snps[] array indices; if GNU awk is not available, or ordering of output messages is not important, then this command can be removed from the code
since we only process each input file once we will print all SNPs that show up in all files (ie, we won't know if a SNP exists in all files until we've processed the last file so might as well print all SNPs that exist in all fiels)
should be faster than processes that require multiple scans of each input file (especially for larger files and/or a large number of SNPs)

This generates:

The SNP ABC was found in all files