Home > Blockchain >  grep: Search for multiple strings in files recursively, to find source code
grep: Search for multiple strings in files recursively, to find source code

Time:02-11

I am fairly confident this can't be done with grep, unless there are some features that I don't know about. However I am hoping that if this is the case there might be some other Linux/Unix command line tool which will do the job I want.

This is a frequent problem when working with source code, so I am pretty sure there must be an adequete solution.

Problem:

I am working with some C source code, and I want to be able to grep for objects in my code to find the files containing the relevant information.

Here is a simple example:

  • Search for all files which contain matches for "MyClass" in the namespace "MYNAMESPACE".
  • Assume that although MyClass and MYNAMESPACE appear to be likely to be unique strings, in general they might not be.
  • In my case, the namespace "MYNAMESPACE" appears in hundreds of source files.
  • The actual name of the class I am searching for is "Parameter", which is such a generic word that it too appears in hundreds of files.

Here is what I want a grep-like tool to do:

  • Specify a list of words to search for
  • Return the list of files found where ALL words in the list of search words are found in the same file
  • Do this recursively to obtain all results in all files in a directory

Surely there is a way to do this? This is essentially a filtering problem: Take all the files found (recursively) inside a directory, and apply a filter to them for each of the words in the input list. Files pass the filter if they contain at least one instance of each word.

CodePudding user response:

Maybe grep piped to xargs grep?

]$ grep -rl "NAMESPACE" | xargs grep -l "Parameter"

With these four files:

]$ tail -n  1 *.txt
==> needle_1.txt <==
...NAMESPACE...
...
...Parameter...

==> needle_2.txt <==
...Parameter...
...
...NAMESPACE...

==> not_needle_1.txt <==
...
...NAMESPACE...
...

==> not_needle_2.txt <==
...
...Parameter...
...

placed in each sub-directory (including .) of:

.
├── dir_1
│   ├── dir_1
│   └── dir_2
└── dir_2
    ├── dir_1
    └── dir_2

the result is:

]$ grep -rl "NAMESPACE" | xargs grep -l "Parameter" | sort
dir_1/dir_1/needle_1.txt
dir_1/dir_1/needle_2.txt
dir_1/dir_2/needle_1.txt
dir_1/dir_2/needle_2.txt
dir_1/needle_1.txt
dir_1/needle_2.txt
dir_2/dir_1/needle_1.txt
dir_2/dir_1/needle_2.txt
dir_2/dir_2/needle_1.txt
dir_2/dir_2/needle_2.txt
dir_2/needle_1.txt
dir_2/needle_2.txt
needle_1.txt
needle_2.txt

CodePudding user response:

I suggest to use gawk (standard Linux awk) script. Scanning each file once for all the words (read each file as a single record).

Count matched words in file.

Print file name only if all words matched.

script.awk

BEGIN {
  RS="!@!@!@!@!@!@!@"; # set record seperator to something unlikely matched, causing each file to be read entirely as a single record
  getline wordsListStr < wordsListFile ; # read wordsListFile as single string wordsListStr
  close(wordsListFile) ; 
  wordsListCount = split(wordsListStr, wordsListArr, "\n"); # split wordsListStr by newLine into array wordsListArr, saved array length into wordsListCount
  for (currWord in wordsListArr) wordsMatchArr[currWord] = 0; # reset array wordsMatchArr to 0
}
{ # for each file (read as single record)
  for (currWord in wordsListArr) { # for each matching word
    if ($0 ~ currWord) wordsMatchArr[currWord] = 1; # if a word was matched mark it a match in wordsMatchArr
  }
}
ENDFILE { # post processing each file
  for (currWord in wordsListArr) {  # scan wordsListArr 
    wordsMatchCountInFile  = wordsMatchArr[currWord]; # count number of matched words
    wordsMatchArr[currWord] = 0; # reset wordsMatchArr for next file
  }
  if (wordsMatchCountInFile == wordsListCount) print FILENAME; # print current file if all words matched
  wordsMatchCountInFile = 0; # reset words counter in file
}

Testing files

input.1.txt

word1
word2
word3

input.2.txt

word1
word2

input.3.txt

word3
word7
word8

input.4.txt

word3
word3
word7
word8
word1
word1
word7
word2
word8

testing output:

awk -v wordsListFile=input.1.txt -f script.awk input.{2,3,4}.txt
input.4.txt

scanning all C files under current directory

awk -v wordsListFile=nameSpacesListFile.txt -f script.awk $(find . -type f -name "*cpp")
  • Related