Regex pattern to fetch data inside square brackets-CodePudding

My Regex pattern:

^(?<timestamp>[a-zA-Z]{3} [0-9]{1,2} [0-9]{1,2}\:[0-9]{1,2}\:[0-9]{1,2}\.[0-9]{1,6}) (?<levelname>[A-Z]) ?(ERR:)? ?(?<source>\[.*\])? (?<message>.*)

Test_String_1 = "Oct 25 14:24:29.700799 I [System] Connected"

Output as per my regex pattern:

Match groups:
timestamp   Oct 25 14:24:29.700799
levelname   I
source      [System]
message     Connected

Test_String_2 = "Oct 25 14:24:30.315344 E ERR: [[Signal]] Valid Shared Mem!"

Output as per my regex pattern:

Match groups:
timestamp   Oct 25 14:24:30.315344
levelname   E
source      [[Signal]]
message     Valid Shared Mem!

However I am expecting the below results for Test_String_1 and Test_String_2:

Test_String_1:

Match groups:
timestamp   Oct 25 14:24:29.700799
levelname   I
source      System    
message     Connected

Test_String_2:

Match groups:
timestamp   Oct 25 14:24:30.315344
levelname   E
source      Signal
message     Valid Shared Mem!

What changes should I made in my regex pattern to get the expected result. I'm using https://rubular.com/ for regex testing.

[Edit]: Test_String_3 = "Oct 25 14:24:29.653900 D Connection refused"

Expected output:

Match groups:
timestamp   Oct 25 14:24:29.653900
levelname   D
source  
message     Connection refused

CodePudding user response：

I think this is what you want

^(?<timestamp>[a-zA-Z]{3} [0-9]{1,2} [0-9]{1,2}\:[0-9]{1,2}\:[0-9]{1,2}\.[0-9]{1,6}) (?<levelname>[A-Z]) ?(ERR:)? ?(\[*(?<source>\w*)\]*) (?<message>.*)

brackets should wrap the source field.

CodePudding user response：

You can match the following regular expression.

^(?P<timestamp>[JFMASOND][a-z]{2} [0123]\d [012]\d(?::[0-5]\d){2}\.\d{6}\b) (?P<levelname>[A-Z])  (?:[A-Z] :  )?(?:\[ (?P<source>[A-Za-z] )\] )? *(?P<message>. )

Demo

Notice that I've made the capture group source optional.

Depending on requirements some adjustments may need to be made. I assumed, for example, that the source capture group would contain a single word and if there were non-spaces between the levelname and source (or message) it would be comprised of one or more capital letters followed by a colon, as in the second example ('ERR:'). I've also made assumptions about how rigorous the timestamp format must be specified and which capture groups should be made optional. These were of course just guesses about the specification as they were not spelled out in the question.

The regular expression can be broken down as follows. Note that I have put individual spaces in character classes ([ ]) merely to make them visible to the reader. I've tested this with Python (for which named character classes are written (?P<name>....), but it would work in Ruby as well.

^                     # match beginning of string

(?P<timestamp>        # begin 'timestamp' capture group
  [JFMASOND]          # match a cap letter in the char class
  [a-z]{2}            # match two lowercase letters
  [ ]                 # match a space
  [0123]\d            # match a digit in the char class then any digit
  [ ]                 # match a space
  [012]\d             # match a digit in the char class then any digit
  (?:                 # begin a non-capture group
    :                 # match a colon
    [0-5]\d           # match a digit in the char class then any digit 
  ){2}                # end non-capture group and execute it twice
  \.                  # match a period
  \d{6}               # match 6 digits
  \b                  # match a word boundary
)                     # end timestamp capture group

(?P<levelname>        # begin 'levelname' capture group
  [A-Z])[ ]           # match a capital letter then >= 1 spaces
)                     # end 'levelname' capture group

(?:[A-Z] :[ ] )?      # optionally match >= 1 capital letters
                      # then >= 1 spaces

(?:                   # begin non-capture group
  \[                  # match one or more left brackets
  (?P<source>         # begin capture group 'source'
    [A-Za-z]          # match >= 1 chars in char class
  )                   # end capture group 'source'
  \]                  # match one or more right brackets
)?                    # end non-capture group and make optional

[ ]*                  # match >= 0 spaces

(?P<message>. )       # match rest of line and save to capture
                      # group 'message'