awk FS vs FPAT puzzle and counting words but not blank fields-CodePudding

Suppose I have the file:

$ cat file
This, that;
this-that or this.

(Punctuation at the line end is not always there...)

Now I want to count words (with words being defined as one or more ascii case-insensitive letters.) In typical POSIX *nix you could do:

sed -nE 's/[^[:alpha:]] / /g; s/ $//p' file | tr ' ' "\n"  | tr '[:upper:]' '[:lower:]' | sort | uniq -c
   1 or
   2 that
   3 this

With grep you can shorten that a bit to only match what you define as a word:

grep -oE '[[:alpha:]] ' file | tr '[:upper:]' '[:lower:]' | sort | uniq -c
# same output

With GNU awk, you can use FPAT to replicate matching only what you want (ignore sorting...):

gawk -v FPAT="[[:alpha:]] " '
{for (i=1;i<=NF;i  ) {seen[tolower($i)]  }}
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
   3 this
   1 or
   2 that

Now trying to replicate in POSIX awk I tried:

awk 'BEGIN{FS="[^[:alpha:]] "}
{ for (i=1;i<=NF;i  ) seen[tolower($i)]   }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
   2 
   3 this
   1 or
   2 that

Note the 2 with blank at top. This is from having blank fields from ; at the end of line 1 and . at the end of line 2. If you delete the punctuation at line's end, this issue goes away.

You can partially fix it (for all but the last line) by setting RS="" in the awk, but still get a blank field with the last (only) line.

I can also fix it this way:

awk 'BEGIN{FS="[^[:alpha:]] "}
{ for (i=1;i<=NF;i  ) if ($i) seen[tolower($i)]   }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file

Which seems a little less than straight forward.

Is there an idiomatic fix I am missing to make POSIX awk act similarly to GNU awk's FPAT solution here?

CodePudding user response：

This should work in POSIX/BSD or any version of awk:

awk -F '[^[:alpha:]] ' '
{for (i=1; i<=NF;   i) ($i != "") &&   count[tolower($i)]}
END {for (e in count) printf "%4s %s\n", count[e], e}' file

   1 or
   3 this
   2 that

By using -F '[^[:alpha:]] ' we are splitting fields on any non-alpha character.
($i != "") condition will make sure to count only non-empty fields in seen.

CodePudding user response：

Using RS instead:

$ gawk -v RS="[^[:alpha:]] " '  # [^a-zA-Z] or something for some awks
$0 {                            # remove possible leading null string
    a[tolower($0)]  
}
END {
    for(i in a)
        print i,a[i]
}' file

Output:

this 3
or 1
that 2

Tested successfully on gawk and Mac awk (version 20200816) and on mawk and busybox awk using [^a-zA-Z]

CodePudding user response：

With POSIX awk, I'd use match and the builtin RSTART and RLENGTH variables:

#!awk
{
    s = $0
    while (match(s, /[[:alpha:]] /)) {
        word = substr(s, RSTART, RLENGTH)
        count[tolower(word)]  
        s = substr(s, RSTART RLENGTH)
    }
}
END {
    for (word in count) print count[word], word
}

$ awk -f countwords.awk file
1 or
3 this
2 that

Works with the default BSD awk on my Mac.

CodePudding user response：

With your shown samples, please try following awk code. Written and tested in GNU awk in case you are ok to do this with RS approach.

awk -v RS='[[:alpha:]] ' '
RT{
  val[tolower(RT)]  
}
END{
  for(word in val){
    print val[word], word
  }
}
' Input_file

Explanation: Simple explanation would be, using RS variable of awk to make record separator as [[:alpha:]] then in main program creating array val whose index is RT variable and keep counting its occurrences with respect to same index in array val. In END block of this program traversing through array and printing indexes with its respective values.

CodePudding user response：

With GNU awk using patsplit() and a second array for counting, you can try this:

awk 'patsplit($0, a, /[[:alpha:]] /) {for (i in a) b[ tolower(a[i]) ]  } END {for (j in b) print b[j], j}' file
3 this
1 or
2 that