Home > Software design >  Find files that contain a string and do not contain another
Find files that contain a string and do not contain another

Time:07-21

How do I find out the files in the current directory which do not contain the specific string but contain another? (by using grep) I have most work done by this:

grep -L --include=\*.tf --include=file.hcl --exclude=versions.tf -rnw '.' -e 'string\s*\"string\"\s*'

and i have list of files taht don't contain this string. What to add?

CodePudding user response:

You can make use of a grep-pipeline. The following two are equivalent in speed (the second is slightly faster):

$ grep -lZ have_string * | xargs -0 grep -L avoid_string
$ grep -LZ avoid_string * | xargs -0 grep -l have_string

However, when you process a lot of files, it might be better to avoid reading files twice as this might be time-consuming. A single awk program might be beneficial in this case as it will only read the files a single time and perform early terminations if needed:

$ awk 'BEGINFILE{f=0}
       /avoid_string/{nextfile}
       /have_string/{f=1}
       ENDFILE{if(f) print FILENAME}
      ' *

This uses GNU-awk specific features which are not part of POSIX (nextfile, BEGINFILE and ENDFILE). The awk lines does the following:

  • BEGINFILE{f=0} when a file is opened for reading, set the flag f to zero. This flag keeps track if the string have_string is found in the file.
  • /avoid_string/{nextfile}, for every line, check if the string "avoid_string" is part of the line. If it is, move to the next file
  • /have_string/{f=1}, for every line, check if the string "have_string" is part of the line, if so update the flag f to indicate we found the string.
  • ENDFILE{if(f) print FILENAME}: at the end of the file, check if we found the string have_string. If we did, print the filename'

CodePudding user response:

or u can make it much simpler, without all the extra blocks, and in a much more portable fashion :

 mawk '
 NF >=  /[.][0-9] $/ {

    if (NF == !_) { 
         print FILENAME, $0
    }
    nextfile }' FS='N$' OFS='\f' 
   
           <(  date -j  '%T %D %s.%N' )
           <( gdate     '%T %D %s.%N' ) 
/dev/fd/12
          08:16:47 07/21/22 1658405807.574681000

cuz bsd date lacks nano secs, so placing the gnu-date %N formatting symbol there results in hideous epochs like 1658405807.N

so in this example, the unwanted string is the bsd-date one with the N at the tail (which i set as FS). The total condition only hits true when either NF is greater than 1 (meaning the unwanted string was found), or when the right side regex evaluates to true (the desired string).

when either hits, it would enter the printing section. This is obviously a simplified scenario of course - you can simply use an array to save up all the filenames you've found with the desirable string, and delete them if you've found the bad one all of a sudden.

And instead of checking for FNR == 1 all the time (something that has similar effect as BEGINFILE { } when emulated properly - slightly more hassle with workarounds for proper NR/FNR tracking, but much more portable) -

simply add to the array when you've found the good string(s), delete from it and immediately nextfile once the bad one shows up, and only once all the files have been processed, in an END { } block, shall you proceed with printing out all the filenames still left inside the array.

my example used the shell <( ) notation to emulate files coming in, that's why the filename printed look awkward, but you can see it indeed properly captured the gnu-date one instead of the bsd one.

  • Related