I got a badly managed log, and need to extract into a dictionary using Python.
# Pattern:
"kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=..."
# where
# - kw1=a
# - kw2=b, (b, b=b), bb
# - kw3=c
# - and so on
# extract into a dict:
out = {kw1: "a", kw2: "b, (b, b=b), bb", kw3: "c", kw4: ...}
Q1: Is there a regex expression that helps me get above key and value?
Q2: Got unexpected result. ', (.*?)=' should give me the shortest matching between ',' and '=' right?
msg = 'a, a, b=b, c=c'
re.findall(', (.*?)=', msg)
>>> ['a, b', 'c']
# I was expecting ['b','c']
# shouldn't ', (.*?)=' give me the shortest matching between ',' and '='? which is 'b' instead of 'a, b'
CodePudding user response:
Answer to Q1:
Here is my suggestion:
import re
s = "kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=..."
pattern = r'(?=(kw.)=(.*?)(?:, kw.=|$))'
result = dict(re.findall(pattern, s))
print(result) # {'kw1': 'a', 'kw2': 'b, (b, b=b), bb', 'kw3': 'c', 'kw4': '...'}
To explain the regex:
- the (?=...) is a lookahead assertion to let you find overlapping matches
- the ? in (.*?) makes the quantifier * (asterisk) non-greedy
- the ?: makes the group (?:, kw.=|$) non-capturing
- the |$ at the end allows to take account of the last value in your string
Answer to Q2:
No, this is wrong. The quantifier *? is non-greedy, so it finds the first match. Moreover there is no search for overlapping matches , which could be done with (?=...). So your observed result is the expected done. I may suggest you this simple solution:
msg = 'a, a, b=b, c=c'
result = re.findall(', ([^,]*?)=', msg)
print(result) # ['b', 'c']
CodePudding user response:
Q1: Is there a regex expression that helps me get above key and value?
To get the key:value in a dictionary format you can use
Say your string is
"kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=dd, kw10=jndn"
Using the following regex gives you key and values in a list
results = re.findall(r'(\bkw\d )=(.*?)(?=, \s*\bkw\d =|$)', s)
[('kw1', 'a'), ('kw2', 'b, (b, b=b), bb'), ('kw3', 'c'), ('kw4', 'dd'), ('kw10', 'jndn')]
You can convert it to a dictionary as
dict(results)
Output :
{
'kw1': 'a',
'kw2': 'b, (b, b=b), bb',
'kw3': 'c',
'kw4': 'dd',
'kw10': 'jndn'
}
Explanation :
\b
is used like a word boundary and will only match kw and not something likeXYZkw
\kw\d =
Match the word kw followed by1
digits and=
.*?
(Lazy Match) Match as least chars as possible(?=
Positive lookahead, assert to the right\s*\bkw\d =
Match optional whitespace chars, then pat,1
digits and=
|
Or$
Assert the end of the string for the last part
)
Close the lookahead