Home > front end >  How to grep/perl/awk overlapping regex
How to grep/perl/awk overlapping regex

Time:10-06

Trying to pipe a string into a grep/perl regex to pull out overlapping matches. Currently, the results only appear to pull out sequential matches without any "lookback":

Attempt using egrep (both on GNU and BSD):

$ echo "bob mary mike bill kim jim john" | egrep -io "[a-z]  [a-z] "
bob mary
mike bill
kim jim

Attempt using perl style grep (-P):

$ echo "bob mary mike bill kim jim john" | grep -oP "()[a-z]  [a-z] "
bob mary
mike bill
kim jim

Attempt using awk showing only the first match:

$ echo "bob mary mike bill kim jim john" | awk 'match($0, /[a-z]  [a-z] /) {print substr($0, RSTART, RLENGTH)}'
bob mary

The overlapping results I'd like to see from a simple working bash pipe command are:

bob mary
mary mike
mike bill
bill kim
kim jim
jim john

Any ideas?

CodePudding user response:

Lookahead is your friend here

echo "bob mary mike bill kim jim john" | 
    perl -wnE'say "$1 $2" while /(\w )\s (?=(\w ))/g'

CodePudding user response:

You can also use awk

awk '{for(i=1;i<NF;i  ) print $i,$(i 1)}' <<< 'bob mary mike bill kim jim john'

See the online demo. This solution iterates over all whitespace-separated fields and prints current field ($i) field separator (a space here) the subsequent field value ($(i 1)).

Or, another perl solution:

perl -lane 'while (/(?=\b(\p{L} \s \p{L} ))/g) {print $1}' <<< 'bob mary mike bill kim jim john'

See the online demo. Details:

  • (?= - start of a positive lookahead
    • \b - a word boundary
    • (\p{L} \s \p{L} ) - capturing group 1: one or more letters, one or more whitespaces, one or more letters
  • ) - end of the lookahead.

Here, only Group 1 values are printed ({print $1}).

  • Related