Using awk to print top 5 words (that are 7 letters or more) in a text file in linux-CodePudding

Update: -

This is the code, I referred to, to find the top 5 used 7-letter words in my file:

cat uly.txt | tr -cs "[:alnum:]" "\n"| tr "[:lower:]" "[:upper:]" | awk '{h[$1]  }END{for (i in h){print h[i]" "i}}'| grep -w "\w\{7\}" -w | sort -nr | cat -n | head -n 5

This is the output:this picture shows the top 5 words written

This is the file link: https://www.gutenberg.org/cache/epub/996/pg996.txt

CodePudding user response：

Would you please try the following:

tr [:lower:] [:upper:] < uly.txt | awk '
{
    for (i=1; i<=NF; i  ) {
        sub(/[^[:alnum:]] $/, "", $i)
        if (length($i) >= 7) h[$i]  
    }
}
END {
    for (i in h) {print h[i]" "i}
}' | sort -nr | cat -n | head -n 5

Output:

     1  3 QUIXOTE
     2  3 PROJECT
     3  3 GUTENBERG
     4  2 LOCATED
     5  2 HISTORY

CodePudding user response：

If you are using GNU AWK you often can do what other text processing commands do, I will present subsequent steps starting from

cat uly.txt | tr -cs "[:alnum:]" "\n"| tr "[:lower:]" "[:upper:]" | awk '{h[$1]  }END{for (i in h){print h[i]" "i}}'| grep -w "\w\{7\}" -w | sort -nr | cat -n | head -n 5

firstly GNU AWK has tolower among String Functions so we can use it instead of tr for lowercasing as follows

cat uly.txt | tr -cs "[:alnum:]" "\n"| awk '{h[tolower($1)]  }END{for (i in h){print h[i]" "i}}'| grep -w "\w\{7\}" -w | sort -nr | cat -n | head -n 5

There is also length function which return number of characters, so we can use it instead of grep as follows

cat uly.txt | tr -cs "[:alnum:]" "\n"| awk 'length($1)==7{h[tolower($1)]  }END{for (i in h){print h[i]" "i}}'| sort -nr | cat -n | head -n 5

conidition before action means we only do action if condition is met, note that will also lead to less memory usage as information for words of different length is not kept.

GNU AWK is able to sort array in various ways, which are described in Using Predefined Array Scanning Orders with gawk as you use value for storing numbers and want descending order (most common to least common) you should use @val_num_desc. This allow replacing sort

cat uly.txt | tr -cs "[:alnum:]" "\n"| awk 'BEGIN{PROCINFO["sorted_in"]="@val_num_desc"}length($1)==7{h[tolower($1)]  }END{for (i in h){print h[i]" "i}}'| cat -n | head -n 5

Getting subsequent numbers inside for loop might be done using one of increment operators, I would use increment-then-return (pluses before variable) rather than return-then-increment (pluses after variable) as you want to have numbering starting at 1 rather than 0. This will be used in place of cat -n that is

cat uly.txt | tr -cs "[:alnum:]" "\n"| awk 'BEGIN{PROCINFO["sorted_in"]="@val_num_desc"}length($1)==7{h[tolower($1)]  }END{for (i in h){print   j" "h[i]" "i}}'| head -n 5

I used j variable as it was not used already. Now we have access to number of line it is easy to end processing after outputing 5 words which will replace head -n 5 that is

cat uly.txt | tr -cs "[:alnum:]" "\n"| awk 'BEGIN{PROCINFO["sorted_in"]="@val_num_desc"}length($1)==7{h[tolower($1)]  }END{for (i in h){print   j" "h[i]" "i;if(j>=5){exit}}}'

Exit statement allows you to end program. This is especially handy if you have huge file, but get answer you need after processing only relative small part of it.

Rather than preprocessing file using tr we can inform GNU AWK what it should consider row, by providing row separator (RS) in this case

cat uly.txt | awk 'BEGIN{RS="[^[:alnum:]] ";PROCINFO["sorted_in"]="@val_num_desc"}length($1)==7{h[tolower($1)]  }END{for (i in h){print   j" "h[i]" "i;if(j>=5){exit}}}'

[^[:alnum:]] meaning is 1 or more ( ) of anything but (^) alphanumeric characters ([:alnum:]). Finally GNU AWK is not limited for using standard input - you might use argument to deliver file(s) to process so we might use that for replacing cat to get

awk 'BEGIN{RS="[^[:alnum:]] ";PROCINFO["sorted_in"]="@val_num_desc"}length($1)==7{h[tolower($1)]  }END{for (i in h){print   j" "h[i]" "i;if(j>=5){exit}}}' uly.txt

Note that piping output of cat into tool which can itself read file is generally considered to be antipattern and even has own name: Useless use of cat