awk FS vs FPAT puzzle and counting words-CodePudding

Suppose I have the file:

$ cat file
This, that;
this-that or this.

Now I want to count words (with words being defined as one or more ascii case-insensitive letters.) In typical POSIX *nix you could do:

sed -nE 's/[^[:alpha:]] / /g; s/ $//p' file | tr ' ' "\n"  | tr '[:upper:]' '[:lower:]' | sort | uniq -c
   1 or
   2 that
   3 this

With grep you can shorten that a bit to only match what you define as a word:

grep -oE '[[:alpha:]] ' file | tr '[:upper:]' '[:lower:]' | sort | uniq -c
# same output

With GNU awk, you can use FPAT to replicate matching only what you want (ignore sorting...):

gawk -v FPAT="[[:alpha:]] " '
{for (i=1;i<=NF;i  ) {seen[tolower($i)]  }}
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
   3 this
   1 or
   2 that

Now trying to replicate in POSIX awk I tried:

awk 'BEGIN{FS="[^[:alpha:]] "}
{ for (i=1;i<=NF;i  ) seen[tolower($i)]   }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
   2 
   3 this
   1 or
   2 that

Note the 2 with blank at top. This is from having blank fields from ; at the end of line 1 and . at the end of line 2. If you delete the punctuation at line's end, this issue goes away.

You can partially fix it (for all but the last line) by setting RS="" in the awk, but still get a blank field with the last (only) line.

I can also fix it this way:

awk 'BEGIN{FS="[^[:alpha:]] "}
{ for (i=1;i<=NF;i  ) if ($i) seen[tolower($i)]   }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file

Which seems a little less than straight forward.

Is there an idiomatic fix I am missing to make POSIX awk act similarly to GNU awk's FPAT solution here?

CodePudding user response：

With POSIX awk, I'd use match and the builtin RSTART and RLENGTH variables:

#!awk
{
    s = $0
    while (match(s, /[[:alpha:]] /)) {
        word = substr(s, RSTART, RLENGTH)
        count[tolower(word)]  
        s = substr(s, RSTART RLENGTH)
    }
}
END {
    for (word in count) print count[word], word
}

$ awk -f countwords.awk file
1 or
3 this
2 that

Works with the default BSD awk on my Mac.

CodePudding user response：

With your shown samples, please try following awk code. Written and tested in GNU awk in case you are ok to do this with RS approach.

awk -v RS='[[:alpha:]] ' '
RT{
  val[tolower(RT)]  
}
END{
  for(word in val){
    print val[word], word
  }
}
' Input_file

Explanation: Simple explanation would be, using RS variable of awk to make record separator as [[:alpha:]] then in main program creating array val whose index is RT variable and keep counting its occurrences with respect to same index in array val. In END block of this program traversing through array and printing indexes with its respective values.