I'm trying to emulate GNU grep -Eo
with a standard awk
call.
What the man says about the -o
option is:
-o --only-matching
Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.
For now I have this code:
#!/bin/sh
regextract() {
[ "$#" -ge 2 ] || return 1
__regextract_ere=$1
shift
awk -v FS='^$' -v ERE="$__regextract_ere" '
{
while ( match($0,ERE) && RLENGTH > 0 ) {
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART 1)
}
}
' "$@"
}
My question is: In the case that the matching part is 0-length
, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input regex that would need the former but I feel like it might exist. Any idea?
CodePudding user response:
Here's a POSIX awk version, which works with a*
(or any POSIX awk regex):
echo abcaaaca |
awk -v regex='a*' '
{
while (match($0, regex)) {
if (RLENGTH) print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART (RLENGTH > 0 ? RLENGTH : 1))
if ($0 == "") break
}
}'
Prints:
a
aaa
a
POSIX awk and grep -E
use POSIX extended regular expressions, except that awk allows C escapes (like \t
) but grep -E
does not. If you wanted strict compatibility you'd have to deal with that.
CodePudding user response:
Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt
content be
1A2A3
then
grep -Eo A* file.txt
gives output
A
A
your while
's condition is match($0,ERE) && RLENGTH > 0
, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART
was set to 1
), thus body of while
will be done zero times.
CodePudding user response:
If you can consider a gnu-awk
solution then using RS
and RT
may give identical behavior of grep -Eo
.
# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20
Using grep -Eo
:
grep -Eo '[[:alnum:]] ' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
Using gnu-awk
with RS
and RT
using same regex:
awk -v RS='[[:alnum:]] ' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
More examples:
grep -Eo '\<[[:digit:]] ' file
11
39
20
awk -v RS='\\<[[:digit:]] ' 'RT != "" {print RT}' file
11
39
20
CodePudding user response:
Thanks to the comments and the various answers I think I now have a working code (tested on AIX/Solaris/FreeBSD/macOS/Linux).
#!/bin/sh
regextract() {
[ "$#" -ge 1 ] || return 1
[ "$#" -eq 1 ] && set -- "$1" -
awk -v FS='^$' '
BEGIN {
ere = ARGV[1]
delete ARGV[1]
}
{
while ( NF && match($0,ere) ) {
if (RLENGTH)
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART (RLENGTH ? RLENGTH : 1))
}
}
' "$@"
}
notes:
I pass the ERE string as a file argument so that
awk
doesn't pre-process it; C-style escape sequences will still be supported though (see @dan answer).Because assigning
$0
does reset the values of all fields, I choseFS = '^$'
for limiting this overheadNF
is always true unless$0
is empty
simple examples:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX
printf '%s\n' '\a[a]' | regextract '\[a]'
[a]
printf '%s\n' '.\t.' | regextract '\\t'
\t