I want to count the number of appearances of each word in a file. I don't wont to count if a word exists as a sub-string inside another word. For example if the word is "in" I don't want to count "inside" as an appearance off the world.
To get the words from the text file I use:
#Stores the words of the file without duplicates
WORDS=`grep -o -E '\w ' $1 | sort -u -f`
To get all the words of the file that has been passed as command line argument.
Then I use the following command to count how many times each word appears:
#Accessing all the words from the file
for WORD in $WORDS
do
#Number of apperences of the WORD in the text file ($1)
APEARENCES=`grep -o -i "$WORD" $1 | wc -l`
***Code***
done
My problem here is that APEARENCES=`grep -o -i "$WORD" $1 | wc -l`
also counts the sub-string of a word if it matches the string of the "$WORD"
. It counts "inside as an appearance of the word "in".
EDIT: I found the solution. Turns out I just needed to add -w
to the expression.
APEARENCES=`grep -o -i -w "$WORD" $1 | wc -l`
CodePudding user response:
grep has -w
to match whole words. uniq
has -c
to print a count.
grep -Eow '\w ' myfile | sort | uniq -c | sort -nk 1,1
Prints a sorted list of word frequencies in myfile
.
Using uniq -i
(not posix), case can also be ignored:
grep -Eow '\w ' | sort -f | uniq -ic | sort -nk 1,1
CodePudding user response:
You can use word boundary (\\b
):
$ cat text.txt
"inside as an appearance of the word "in"
$ grep in text.txt
"inside as an appearance of the word "in"
$ grep \\bin\\b text.txt
"inside as an appearance of the word "in"