Counting the appearances of a word in a text file in bash-CodePudding

I want to count the number of appearances of each word in a file. I don't wont to count if a word exists as a sub-string inside another word. For example if the word is "in" I don't want to count "inside" as an appearance off the world.

To get the words from the text file I use:

#Stores the words of the file without duplicates
WORDS=`grep -o -E '\w ' $1 | sort -u -f`

To get all the words of the file that has been passed as command line argument.

Then I use the following command to count how many times each word appears:

#Accessing all the words from the file
for WORD in $WORDS
do
    #Number of apperences of the WORD in the text file ($1)
    APEARENCES=`grep -o -i "$WORD" $1 | wc -l`

    ***Code***
done

My problem here is that APEARENCES=`grep -o -i "$WORD" $1 | wc -l` also counts the sub-string of a word if it matches the string of the "$WORD". It counts "inside as an appearance of the word "in".

EDIT: I found the solution. Turns out I just needed to add -w to the expression.

APEARENCES=`grep -o -i -w "$WORD" $1 | wc -l`

CodePudding user response：

grep has -w to match whole words. uniq has -c to print a count.

grep -Eow '\w ' myfile | sort | uniq -c | sort -nk 1,1

Prints a sorted list of word frequencies in myfile.

Using uniq -i (not posix), case can also be ignored:

grep -Eow '\w ' | sort -f | uniq -ic | sort -nk 1,1

CodePudding user response：

You can use word boundary (\\b):

$ cat text.txt

"inside as an appearance of the word "in"

$ grep in text.txt

"inside as an appearance of the word "in"

$ grep \\bin\\b text.txt

"inside as an appearance of the word "in"