Home > Net >  Why isn't this gawk gensub() behaving like regex101?
Why isn't this gawk gensub() behaving like regex101?

Time:06-15

I have a gawk script that includes this line:

$0 = gensub(/{\ \ (. ?)\ \ }/, "{\\\\textcolor{added}{\\1}", "g", $0);

On the following input line

- {  first phrase  } swiftly followed {  by a second one  }.

it produces:

- \textcolor{added}{first phrase  } swiftly followed {  by a second one}}

not what I'm expecting:

- \textcolor{added}{first phrase} swiftly followed \textcolor{added}{by a second one}}

When I run the same regex in regex101.com or in the Mac Expressions app, it works as expected. What am I missing?

CodePudding user response:

Notice which pairs of were removed:

# from this:-                   {  first phrase  } swiftly followed {  by a second one  }.
#   to this: - \textcolor{added}{  first phrase  } swiftly followed {  by a second one  ​}}
                                 ​^^                                                   ^^

This confirms barmar's comment about awk being non-greedy with matches.

A couple small changes to the current code:

# current:
# $0 = gensub(/{\ \ (. ?)\ \ }/  , "{\\\\textcolor{added}{\\1}", "g", $0)

# new
  $0 = gensub(/{\ \ ([^ ]*)\ \ }/,  "\\\\textcolor{added}{\\1}", "g", $0)

Where:

  • replace . ? with [^ ] to implement a greedy match
  • removed leading { from replacement string as this doesn't show up in OP's expected output

Taking for a test drive:

echo '- {  first phrase  } swiftly followed {  by a second one  }.' |
awk '{$0 = gensub(/{\ \ ([^ ]*)\ \ }/, "\\\\textcolor{added}{\\1}", "g", $0)} 1'

This generates:

- \textcolor{added}{first phrase} swiftly followed \textcolor{added}{by a second one}.

CodePudding user response:

If you're using non- gnu-awk|gawk but wanna emulate a similar feature, something like this :

- {  first phrase  } swiftly followed {  by a second one  }.

.

mawk 'gsub(______, ___ "&" )       gsub(_____, __ "&" )     \
      gsub(__ "[^ (__ ___) "] " ___, (____)__  "&" ___)      \
      gsub((__)(__) _____)  "|" (___)(___) "[ ][ ]", _)      \
      gsub(__,   "\173  ")          gsub(___, "  \175")    1 ' FS='^$' \
              __='\6\31'  _____='[{][ ][ ]' ____='\134textcolor{added}{' \
             ___='\1\36' ______='[ ][ ][}]'

- \textcolor{added}{first phrase} swiftly followed \textcolor{added}{by a second one}.

Yes it's very verbose (unfortunate downside of it) -

  • I had to play it safe by double checking for isolated { or } without matching pair, and ensure that their original states be properly restored at the cleanup stage,

  • on top of wiping all remaining remnants of the temp SEP combo bytes in the [[:cntrl:]] region that were inserted in lieu of a costly array split.

ps : can't get this whiskey tango of an unwanted code bolding foxtrot of an issue to clear

  • Related