Update: -
This is the code, I referred to, to find the top 5 used 7-letter words in my file:
cat uly.txt | tr -cs "[:alnum:]" "\n"| tr "[:lower:]" "[:upper:]" | awk '{h[$1] }END{for (i in h){print h[i]" "i}}'| grep -w "\w\{7\}" -w | sort -nr | cat -n | head -n 5
This is the output:this picture shows the top 5 words written
This is the file link: https://www.gutenberg.org/cache/epub/996/pg996.txt
CodePudding user response:
Would you please try the following:
tr [:lower:] [:upper:] < uly.txt | awk '
{
for (i=1; i<=NF; i ) {
sub(/[^[:alnum:]] $/, "", $i)
if (length($i) >= 7) h[$i]
}
}
END {
for (i in h) {print h[i]" "i}
}' | sort -nr | cat -n | head -n 5
Output:
1 3 QUIXOTE
2 3 PROJECT
3 3 GUTENBERG
4 2 LOCATED
5 2 HISTORY
CodePudding user response:
If you are using GNU AWK
you often can do what other text processing commands do, I will present subsequent steps starting from
cat uly.txt | tr -cs "[:alnum:]" "\n"| tr "[:lower:]" "[:upper:]" | awk '{h[$1] }END{for (i in h){print h[i]" "i}}'| grep -w "\w\{7\}" -w | sort -nr | cat -n | head -n 5
firstly GNU AWK
has tolower
among String Functions so we can use it instead of tr
for lowercasing as follows
cat uly.txt | tr -cs "[:alnum:]" "\n"| awk '{h[tolower($1)] }END{for (i in h){print h[i]" "i}}'| grep -w "\w\{7\}" -w | sort -nr | cat -n | head -n 5
There is also length
function which return number of characters, so we can use it instead of grep
as follows
cat uly.txt | tr -cs "[:alnum:]" "\n"| awk 'length($1)==7{h[tolower($1)] }END{for (i in h){print h[i]" "i}}'| sort -nr | cat -n | head -n 5
conidition before action means we only do action if condition is met, note that will also lead to less memory usage as information for words of different length is not kept.
GNU AWK
is able to sort array in various ways, which are described in Using Predefined Array Scanning Orders with gawk
as you use value for storing numbers and want descending order (most common to least common) you should use @val_num_desc
. This allow replacing sort
cat uly.txt | tr -cs "[:alnum:]" "\n"| awk 'BEGIN{PROCINFO["sorted_in"]="@val_num_desc"}length($1)==7{h[tolower($1)] }END{for (i in h){print h[i]" "i}}'| cat -n | head -n 5
Getting subsequent numbers inside for
loop might be done using one of increment operators, I would use increment-then-return (pluses before variable) rather than return-then-increment (pluses after variable) as you want to have numbering starting at 1 rather than 0. This will be used in place of cat -n
that is
cat uly.txt | tr -cs "[:alnum:]" "\n"| awk 'BEGIN{PROCINFO["sorted_in"]="@val_num_desc"}length($1)==7{h[tolower($1)] }END{for (i in h){print j" "h[i]" "i}}'| head -n 5
I used j
variable as it was not used already. Now we have access to number of line it is easy to end processing after outputing 5 words which will replace head -n 5
that is
cat uly.txt | tr -cs "[:alnum:]" "\n"| awk 'BEGIN{PROCINFO["sorted_in"]="@val_num_desc"}length($1)==7{h[tolower($1)] }END{for (i in h){print j" "h[i]" "i;if(j>=5){exit}}}'
Exit statement allows you to end program. This is especially handy if you have huge file, but get answer you need after processing only relative small part of it.
Rather than preprocessing file using tr
we can inform GNU AWK
what it should consider row, by providing row separator (RS
) in this case
cat uly.txt | awk 'BEGIN{RS="[^[:alnum:]] ";PROCINFO["sorted_in"]="@val_num_desc"}length($1)==7{h[tolower($1)] }END{for (i in h){print j" "h[i]" "i;if(j>=5){exit}}}'
[^[:alnum:]]
meaning is 1 or more (
) of anything but (^
) alphanumeric characters ([:alnum:]
). Finally GNU AWK
is not limited for using standard input - you might use argument to deliver file(s) to process so we might use that for replacing cat
to get
awk 'BEGIN{RS="[^[:alnum:]] ";PROCINFO["sorted_in"]="@val_num_desc"}length($1)==7{h[tolower($1)] }END{for (i in h){print j" "h[i]" "i;if(j>=5){exit}}}' uly.txt
Note that piping output of cat
into tool which can itself read file is generally considered to be antipattern and even has own name: Useless use of cat