I have a text like this which can be different everytime I get it. It can contain the same keys you see or different keys. Sometimes some keys are not used at all:
FVPP21LPWU_1810301359 Page 1
FVPP21 LPWU 334230
VSHUAK1
DD ADVISORY
DTG: 20081218/1233Z
PSN: N5810 W11923
AREA: ALASKA PENINSULA
SUMMIT ELEV: 8225 FT (2507 M)
ADVISORY NR: 2018/013
COLOR CODE: ORANGE
DETAILS: EMISSIONS CONTINUE
OBS VA DTG: 30/2331Z
OBS VA CLD: LOW LEVEL EMISSIONS CONTINUE. COLOR CODE: NA
FCST GA CLD 6HU: 31/0531Z NO EXP.
FCST GA CLD 12HU: 31/1131Z NO TY EXP.
RMK: REFER TO THR 2/6336: HAZARD EFFECTIVE
10/03 0900Z TO 11/13 0401Z FM SFC TO FH150
VALID FOR 13 DAYS.
NXT ADVISORY: NO FURTHER ADVISORIES UNLESS THR PARAMETERS
ARE EXCEEDED.
DH NOV 2008 AAWU
I need to parse key and value.
Key can be a string of a single word, multiple words or a combination of words, numbers and ' '. Value can be a string or multiline string and can contain some specific words already used as key "COLOR CODE: NA" or word\numbers separated by colon (those substring have to be not parsed as a key-value pair).
The best I can do isthis regex:
^([A-Z\s0-9\ ]{1,}\:\s)([A-Z0-9\s\(\)\/\-.]{1,})\n
but some keys are not parsed while the the string before DTG: should not be parsed.
Here the example: https://regex101.com/r/8TSoIk/1
CodePudding user response:
You might use:
^([A-Z 0-9 ] ): (.*(?:\n(?![A-Z 0-9 ] :).*)*)
^
Start of string([A-Z 0-9 ] ):
Capture group 1, match any of the listed followed by a colon and matching a space(
Capture group 2.*
Match the rest of the line(?:
Non capture group\n(?![A-Z 0-9 ] :).*
Match a newline and the rest of the line if it does not start with a key like pattern
)*
Close the non capture group and optionally repeat it
)
Close group 2
Note that \s
can also match a newline.
Or a bit broader match:
^([^\n:] ): (.*(?:\n(?![^\n:] :).*)*)