Currently I have a string that I want to parse and pick up certain values.
The current regex findall pattern that I have is:
re.findall(r'(?P<key>\w )\s (?P<value>\w )')
With this regex findall pattern I can pick up the key and values of the following:
--key1=value1 --key2=value2
But if the value is a string with spaces, it doesn't pick it up. Examples that doesn't work:
--key1=this is value 1 --key2=value2
--key1=only kvp
--key1=this/doesnt/work/
How can I adjust the regex pattern to pick up the string after the =
sign?
CodePudding user response:
I started by changing your regex to --(?P<key>\w )=(?P<value>\w )
. This way, it uses "=" instead of a whitespace as a separator between key and value. It also requires "--" to precede the key, which seems to be a rule in your data.
Now let tackle the main problem which is to capture as a value everything after the "=" sign unless it is the next key.
This can be done in three steps:
Change regex for the value from
\w
to.
. You want to capture all characters so you cannot limit yourself to just\w
..
will capture everything. Of course this change caused a new problem: thevalue
will now contain everything that follows the key, even if it is "value1 --key2=value2". This will be fixed in the remaining two steps.The next step is to make the regex non-greedy. Change the regex for value from
.
to. ?
and it will capture the least characters it can instead of the most. This still doesn't solve the problem because the regex will capture only one character of the value. We are a step closer, though.The last step is to prevent the regex from stopping capturing the value until it encounter the next key or the end of the string. Add
(?=$|\s--)
at the end.(?=)
is a positive lookahead. It means that the next part must follow the current position but it is not part of the match itself.$|\s--
is an alternation of either end of the string or a whitespace and two dashes.
The finished regex is:
re.findall(r'--(?P<key>\w )=(?P<value>. ?)(?=$|\s--)', string)
It should handle everything other than a value that contains --
.
For example:
import re
string = "--key1=value 1 has--really .:weird:. characters --key2=value2"
result = re.findall(r'--(?P<key>\w )=(?P<value>. ?)(?=$|\s--)', string)
print(result)
gives:
[('key1', 'value 1 has--really .:weird:. characters'), ('key2', 'value2')]