Home > Back-end >  How to count frequency of a word without counting compound words in bash?
How to count frequency of a word without counting compound words in bash?

Time:05-23

I am using this to count the frequency in a text file using bash.

grep -ow -i "and" $1 | wc -l

It counts all the and in the file, including those that are part of compound words, like jerry-and-jeorge. These I wish to ignore and count all other independent and.

CodePudding user response:

With a GNU grep, you can use the following command to count and words that are not enclosed with hyphens:

grep -ioP '\b(?<!-)and\b(?!-)' "$1" | wc -l

Details:

  • P option enables the PCRE regex syntax
  • \b(?<!-)and\b(?!-) matches
    • \b - a word boundary
    • (?<!-) - a negative lookbehind that fails the match if there is a hyphen immediately to the left of the current location
    • and - a fixed string
    • \b - a word boundary
    • (?!-) - a negative lookahead that fails the match if there is a hyphen immediately to the right of the current location.

See the online demo:

#!/bin/bash
s='jerry-and-jeorge, and, aNd, And.'
grep -ioP '\b(?<!-)and\b(?!-)' <<< "$s" | wc -l
# => 3 (not 4)
  • Related