I need to separate (group) the components from lines of text that are essentially composed of a key and value.
The first whole word is the key, followed by an arbitrary amount of white space, then everything else after that white space ends is the value, including more whitespace, but ideally not trailing whitespace.
The key can also have leading whitespace.
Additionally, it is possible for the value to be null, in which case the string should still match with a single capture group for the key
Example
Key: "key"
Value: "I am a value"
Test Cases (underscores represent spaces):
key
key___
___key___
key___I_am_a_value
__key___I_am_a_value
__key______I_am_a_value_______
In all of these situations I'd like to end up with two capture groups, each containing the key and value as they are shown between the quotes above, with the second group being null when a value is not present
To clarify, in this situation I'm using whitespace to refer to spaces and tabs, but not line breaks.
This seems pretty close, except that it still includes trailing whitespace in the value and I'm not sure how to drop that:
(?<key>\w )(?:[ \t]*(?<value>.*))
As a final example to highlight this issue, with the above and this test string (again '_' = ' '):
____people_________john_jim_jen_josh____
I'm getting
key: "people"
value: "john jim jen josh "
when I want:
key: "people"
value: "john jim jen josh"
CodePudding user response:
The problem here is the .*
- it just matches until the end of the line. Something like \S
for all non-whitespace or (?:[ \t]*[^ \t] )*
to go in batches of whitespace and non-whitespace. I think you would need to be exclusive for the trailing whitespaces like this:
(?<key>\w )[ \t] (?<value>(?:[ \t]*[^ \t] )*)
CodePudding user response:
You could use
(?<key>\w )(?!\S)[^\S\r\n]*(?<value>(?:\S (?:[^\S\r\n] \S )*)*)
Explanation
(?<key>\w )
Group key, match 1 word chars(?!\S)
Negative lookahead, assert a whitespace boundary to the right[^\S\r\n]*
Match optional spaces without newlines(?<value>
Group value(?:
Non capture group\S
Match 1 non whitespace chars(?:[^\S\r\n] \S )*
Optionally repeat 1 spaces without newlines followed by 1 non whitespace chars
)*
Close the non capture group and optionally repeat
)
Close group value