How to display all the files in a directory which contains a word a certain number of times?-CodePudding

I currently use this command find -name "*.xml" | xargs grep -c -H -w "word" to display all the XML files in a directory which contains a specific word and how many times. It works quite well (because the word I search appears only once per line, if it wasn't the case I should find another solution because of grep) :

./file1.xml:2
./file2.xml:3
./file3.xml:6

but what is really interesting for me is to only display words with the more word matching.

Anyone knows how to filter this output with only the files with the highest scores ?

thx!

CodePudding user response：

Pipe to awk to compare the counts with the minimum.

find -name "*.xml" | xargs grep -c -H -w "word" | awk -F: '$NF > 3'

CodePudding user response：

With GNU grep, this can be done without resorting to find and xargs:

grep -rHcw --include='*.xml' "word" . | awk -F: '$NF>2'

It is also possible to order the output by occurrence count :

grep -rHcw --include='*.txt' "word" . |
    sed 's/.*:\(.*\)/\1 &/' |
    sort -n |
    sed 's/[^ ]* //'

CodePudding user response：

Cleaner to do it all in one if you can. awk can handle that alone, more efficiently.

$: grep -cHw word ?.xml */?.xml # these ae the files we'll scan for a's
a.xml:4
b.xml:2
c.xml:0
foo/d.xml:1

First, you could use shopt -s globstar and pass your file list as **/*.xml, but there are some possible issues - then again, you aren't telling find to only give you files, either...

If you are just trying to find the file with the MOST hits,

$: awk '/\<word\>/{hits[FILENAME]  } END{ for (f in hits) { if (hits[f] > max) { max = hits[f]; file=f } } print file":"max }' ?.xml */?.xml
a:4

...but if you just want the most hits, and don't want to use awk -

find -name "*.xml" | xargs grep -cHw "word" | sort -t: -k2,2r | head -1

If you're doing that, you should probably use -type f -

find -name "*.xml" -type f | xargs grep -cHw word | sort -t: -k2,2r | head -1

You could use

find -name "*.xml" -type f -exec grep -cHw word {} \; | sort -t: -k2,2r | head -1

but in this case I think maybe xargs might manage to spawn fewer total processes, though it may not matter much unless there's a LOT of files.

If what you want is files with more than a certain minimum number of hits -

$: awk -v min=3 '/\<word\>/{ if (  cnt >= min) { print FILENAME": "min" or more"; cnt=0; nextfile; } }' ?.xml */?.xml 
a: 3 or more

$: awk -v min=2 '/\<word\>/{ if (  cnt >= min) { print FILENAME": "min" or more"; cnt=0; nextfile; } }' ?.xml */?.xml
a: 2 or more
b: 2 or more

This also lets you easily check only specific columns, control exactly what output you want, short-circuit out early on large files if you like, scan multiple patterns and give each one differing behavior, do BEGIN setup or END summary processing, and a lot more. You just have to understand awk a bit and write the logic in as much or as little detail as you need.

If -type f doesn't really matter, try shopt -s globstar to pass indeterminate depth of subdirectories with **. For more exactly the same output as the find solution, use ./**/*.xml

$: shopt -s globstar
$: awk -v min=2 '/\<word\>/{ if (  cnt >= min) { print FILENAME": "min" or more"; cnt=0; nextfile; } }' ./**/*.xml
./a.xml: 2 or more
./b.xml: 2 or more

$: awk '/\<word\>/{hits[FILENAME]  } END{ for (f in hits) { if (hits[f] > max) { max = hits[f]; file=f } } print file":"max }' ./**/*.xml
./a.xml:4

One process, instead of four for a find|xargs|sort|head pipeline.