Home > Blockchain >  stop condition for emulating "grep -oE" with awk
stop condition for emulating "grep -oE" with awk

Time:07-26

I'm trying to emulate GNU grep -Eo with a standard awk call.

What the man says about the -o option is:

-o --only-matching
     Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.

For now I have this code:

#!/bin/sh

regextract() {
    [ "$#" -ge 2 ] || return 1
    __regextract_ere=$1
    shift
    awk -v FS='^$' -v ERE="$__regextract_ere" '
        {
            while ( match($0,ERE) && RLENGTH > 0 ) {
                print substr($0,RSTART,RLENGTH)
                $0 = substr($0,RSTART 1)
            }
        }
    ' "$@"
}

My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input regex that would need the former but I feel like it might exist. Any idea?

CodePudding user response:

Here's a POSIX awk version, which works with a* (or any POSIX awk regex):

echo abcaaaca |
awk -v regex='a*' '
{
    while (match($0, regex)) {
        if (RLENGTH) print substr($0, RSTART, RLENGTH)
        $0 = substr($0, RSTART   (RLENGTH > 0 ? RLENGTH : 1))
        if ($0 == "") break
    }
}'

Prints:

a
aaa
a

POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.

CodePudding user response:

Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be

1A2A3

then

grep -Eo A* file.txt

gives output

A
A

your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.

CodePudding user response:

If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.

# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20

Using grep -Eo:

grep -Eo '[[:alnum:]] ' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20

Using gnu-awk with RS and RT using same regex:

awk -v RS='[[:alnum:]] ' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20

More examples:

grep -Eo '\<[[:digit:]] ' file
11
39
20

awk -v RS='\\<[[:digit:]] ' 'RT != "" {print RT}' file
11
39
20

CodePudding user response:

Thanks to the comments and the various answers I think I now have a working code (tested on AIX/Solaris/FreeBSD/macOS/Linux).

#!/bin/sh
  
regextract() {

    [ "$#" -ge 1 ] || return 1
    [ "$#" -eq 1 ] && set -- "$1" -

    awk -v FS='^$' '
        BEGIN {
            ere = ARGV[1]
            delete ARGV[1]
        }
        {
            while ( NF && match($0,ere) ) {
                if (RLENGTH)
                    print substr($0,RSTART,RLENGTH)
                $0 = substr($0,RSTART (RLENGTH ? RLENGTH : 1))
            }
        }
    ' "$@"
}

notes:

  • I pass the ERE string as a file argument so that awk doesn't pre-process it; C-style escape sequences will still be supported though (see @dan answer).

  • Because assigning $0 does reset the values of all fields, I chose FS = '^$' for limiting this overhead

  • NF is always true unless $0 is empty

simple examples:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX

printf '%s\n' '\a[a]' | regextract '\[a]'
[a]

printf '%s\n' '.\t.' | regextract '\\t'
\t
  • Related