Home > OS >  Perl optional capture groups not working?
Perl optional capture groups not working?

Time:10-13

I have the following sample.txt file:

2021-10-07 10:32:05,767 ERROR [LAWT2] blah.blah.blah - Message processing FAILED: <ExecutionReport blah="xxx" foo="yyy" SessionID="kkk" MoreStuff="zz"> Total time for which application threads were stopped: 0.0003858 seconds, Stopping threads took: 0.0000653 seconds
2021-10-07 10:31:32,902 ERROR [LAWT6] blah.blah.blah - Message processing FAILED: <NewOrderSingle SessionID="zkx" TargetSubID="ttt" Account="blah" MsgType="D" BookingTypeOverride="0" Symbol="6316" OtherField1="othervalue1" Otherfield2="othervalue2"/></D></NewOrderSingle>

I want to grab just two key fields: "SessionID" and "MsgType" and print like this:

SessionID="kkk"|
SessionID="zkx"|MsgType="D"

In other words: if the group match is not there, I want just to print blank.

I've tried the following approach but no luck:

$$ perl -ne '/ (SessionID=".*?")? .*(MsgType=".*?")? / and print "$1|$2\n"' sample.txt
SessionID="kkk"|
SessionID="zkx"|

Can somebody enlighten me here? Thank you a lot.

CodePudding user response:

This isn't as easy as it seems:

/ (SessionID=".*?")? .*(MsgType=".*?")? /
                     ~~

The underlined part matches the MsgType even if it's present, even if you add the ? to it. The engine tries to match the longest possible part from the left, so it won't give MsgType back if it can succeed by matching it as soon as it can.

But it's possible using lookaround assertions:

/ (SessionID="[^"]*")? (?:(?!.*?MsgType)|.*? (MsgType=".*?")).* /

i.e. either there's no MsgType following the SessionID, or it's there and we capture it.

I wouldn't recommend using quantifiers on capture groups. Also, looks like the log contains XML, what about extracting it and using a parser?

CodePudding user response:

You can use

perl -ne '/\h(SessionID="[^"]*")?(?:\h  .*(MsgType="[^"]*"))?\h/ and print "$1|$2\n"' 

See the regex demo. Details:

  • \h - a horizontal whitespace
  • (SessionID="[^"]*")? - Group 1: an optional SessionID=", any zero or more chars other than ", and then a "
  • (?:\h .*(MsgType=".*?"))? - an optional (but greedy) sequence of
    • \h - one or more horizontal whitespaces
    • .* - any zero or more chars other than line break chars as many as possible
    • (MsgType="[^"]*") - Group 2: SessionID=", any zero or more chars other than ", and then a "
  • \h - a horizontal whitespace.

See the online demo:

s='2021-10-07 10:32:05,767 ERROR [LAWT2] blah.blah.blah - Message processing FAILED: <ExecutionReport blah="xxx" foo="yyy" SessionID="kkk" MoreStuff="zz"> Total time for which application threads were stopped: 0.0003858 seconds, Stopping threads took: 0.0000653 seconds
2021-10-07 10:31:32,902 ERROR [LAWT6] blah.blah.blah - Message processing FAILED: <NewOrderSingle SessionID="zkx" TargetSubID="ttt" Account="blah" MsgType="D" BookingTypeOverride="0" Symbol="6316" OtherField1="othervalue1" Otherfield2="othervalue2"/></D></NewOrderSingle>'
perl -ne '/\h(SessionID=".*?")?(?:\h  .*(MsgType=".*?"))?\h/ and print "$1|$2\n"' <<< "$s"

This prints:

SessionID="kkk"|
SessionID="zkx"|MsgType="D"

CodePudding user response:

Sorry, one point that I didn't mention on my question is that I was planning to extract multiple fields and print them on a determined order, so I've ended up writing an awk script instead.

I'm putting it here in case someone else wants to use (I'm dealing with thousands of rows on a log file, so a script it's a good option).

#!/usr/bin/awk
function get_field(the_array, the_field, the_line){
  for (key in the_array) {
      if (the_array[key] ~ the_field){
          if (the_line == "")
              the_line = the_array[key]
          else
              the_line = the_line "|" the_array[key]
          break
      }
  }
  return the_line
}
BEGIN{
    the_line = ""
}
{
    the_line = ""
    delete the_keys
    for(f=1;f<=NF;f  ){
        if (($f ~ "^(ClOrdID|Symbol|MsgType|SessionID|OrdStatus)=") && (the_keys[$f] == "")){
            if (the_line == "")
                the_line = $f
            else
                the_line = $f"|"the_line
            the_keys[$f]  
        }
    }
    arr[the_line]  
}
END{
    for(i in arr) {
        if (i ~ "|"){
            the_line = ""
            split(i,aa,"|")
            # Print the fields in the correct order
            the_line = get_field(aa,"SessionID",the_line)
            the_line = get_field(aa,"ClOrdID",the_line)
            the_line = get_field(aa,"MsgType",the_line)
            the_line = get_field(aa,"OrdStatus",the_line)
            the_line = get_field(aa,"Symbol",the_line)
            print the_line
        } else {
            print(i)
        }
    }
}

Using it:

$$ awk -f aa.awk sample.txt
SessionID="kkk"
SessionID="zkx"|MsgType="D"|Symbol="6316"
  • Related