stop condition for emulating "grep -oE" with awk-CodePudding

I'm trying to emulate GNU grep -Eo with a standard awk call.

What the man says about the -o option is:

-o --only-matching
Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.

For now I have this code:

#!/bin/sh

regextract() {
    [ "$#" -ge 2 ] || return 1
    __regextract_ere=$1
    shift
    awk -v FS='^$' -v ERE="$__regextract_ere" '
        {
            while ( match($0,ERE) && RLENGTH > 0 ) {
                print substr($0,RSTART,RLENGTH)
                $0 = substr($0,RSTART 1)
            }
        }
    ' "$@"
}

My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input regex that would need the former but I feel like it might exist. Any idea?

CodePudding user response：

Here's a POSIX awk version, which works with a* (or any POSIX awk regex):

echo abcaaaca |
awk -v regex='a*' '
{
    while (match($0, regex)) {
        if (RLENGTH) print substr($0, RSTART, RLENGTH)
        $0 = substr($0, RSTART   (RLENGTH > 0 ? RLENGTH : 1))
        if ($0 == "") break
    }
}'

Prints:

a
aaa
a

POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.

CodePudding user response：

Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be

1A2A3

then

grep -Eo A* file.txt

gives output

A
A

your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.

CodePudding user response：

If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.

# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20

Using grep -Eo:

grep -Eo '[[:alnum:]] ' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20

Using gnu-awk with RS and RT using same regex:

awk -v RS='[[:alnum:]] ' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20

More examples:

grep -Eo '\<[[:digit:]] ' file
11
39
20

awk -v RS='\\<[[:digit:]] ' 'RT != "" {print RT}' file
11
39
20

CodePudding user response：

Thanks to the comments and the various answers I think I now have a working code (tested on AIX/Solaris/FreeBSD/macOS/Linux).

#!/bin/sh
  
regextract() {

    [ "$#" -ge 1 ] || return 1
    [ "$#" -eq 1 ] && set -- "$1" -

    awk -v FS='^$' '
        BEGIN {
            ere = ARGV[1]
            delete ARGV[1]
        }
        {
            while ( NF && match($0,ere) ) {
                if (RLENGTH)
                    print substr($0,RSTART,RLENGTH)
                $0 = substr($0,RSTART (RLENGTH ? RLENGTH : 1))
            }
        }
    ' "$@"
}

notes:

I pass the ERE string as a file argument so that awk doesn't pre-process it; C-style escape sequences will still be supported though (see @dan answer).
Because assigning $0 does reset the values of all fields, I chose FS = '^$' for limiting this overhead
NF is always true unless $0 is empty

simple examples:

echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX

printf '%s\n' '\a[a]' | regextract '\[a]'
[a]

printf '%s\n' '.\t.' | regextract '\\t'
\t