Home > OS >  Need regex to parse colon separated key-value pairs with multiline
Need regex to parse colon separated key-value pairs with multiline

Time:12-29

I have a text like this which can be different everytime I get it. It can contain the same keys you see or different keys. Sometimes some keys are not used at all:

FVPP21LPWU_1810301359                                                         Page 1
FVPP21 LPWU 334230
VSHUAK1
DD ADVISORY
DTG: 20081218/1233Z
PSN: N5810 W11923
AREA: ALASKA PENINSULA
SUMMIT ELEV: 8225 FT (2507 M)
ADVISORY NR: 2018/013
COLOR CODE: ORANGE
DETAILS: EMISSIONS CONTINUE
OBS VA DTG: 30/2331Z
OBS VA CLD: LOW LEVEL EMISSIONS CONTINUE. COLOR CODE: NA
FCST GA CLD  6HU: 31/0531Z NO EXP.
FCST GA CLD  12HU: 31/1131Z NO TY EXP.
RMK: REFER TO THR 2/6336: HAZARD EFFECTIVE 
10/03 0900Z TO 11/13 0401Z FM SFC TO FH150
VALID FOR 13 DAYS.
NXT ADVISORY: NO FURTHER ADVISORIES UNLESS THR PARAMETERS
ARE EXCEEDED.
DH NOV 2008 AAWU

I need to parse key and value.

Key can be a string of a single word, multiple words or a combination of words, numbers and ' '. Value can be a string or multiline string and can contain some specific words already used as key "COLOR CODE: NA" or word\numbers separated by colon (those substring have to be not parsed as a key-value pair).

The best I can do isthis regex:

^([A-Z\s0-9\ ]{1,}\:\s)([A-Z0-9\s\(\)\/\-.]{1,})\n

but some keys are not parsed while the the string before DTG: should not be parsed.

Here the example: https://regex101.com/r/8TSoIk/1

CodePudding user response:

You might use:

^([A-Z 0-9 ] ): (.*(?:\n(?![A-Z 0-9 ] :).*)*)
  • ^ Start of string
  • ([A-Z 0-9 ] ): Capture group 1, match any of the listed followed by a colon and matching a space
  • ( Capture group 2
    • .* Match the rest of the line
    • (?: Non capture group
      • \n(?![A-Z 0-9 ] :).* Match a newline and the rest of the line if it does not start with a key like pattern
    • )* Close the non capture group and optionally repeat it
  • ) Close group 2

Regex demo

Note that \s can also match a newline.

Or a bit broader match:

^([^\n:] ): (.*(?:\n(?![^\n:] :).*)*)

Regex demo

  • Related