Home > Blockchain >  Extract Information with brackets using python
Extract Information with brackets using python

Time:08-20

I got a badly managed log, and need to extract into a dictionary using Python.

# Pattern:
"kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=..."

# where
# - kw1=a
# - kw2=b, (b, b=b), bb
# - kw3=c
# - and so on

# extract into a dict:
out = {kw1: "a", kw2: "b, (b, b=b), bb", kw3: "c", kw4: ...}

Q1: Is there a regex expression that helps me get above key and value?

Q2: Got unexpected result. ', (.*?)=' should give me the shortest matching between ',' and '=' right?

msg = 'a, a, b=b, c=c'
re.findall(', (.*?)=', msg)
>>> ['a, b', 'c']
# I was expecting ['b','c']
# shouldn't ', (.*?)=' give me the shortest matching between ',' and '='? which is 'b' instead of 'a, b'

CodePudding user response:

Answer to Q1:

Here is my suggestion:

import re
s = "kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=..."
pattern = r'(?=(kw.)=(.*?)(?:, kw.=|$))'
result = dict(re.findall(pattern, s))
print(result) # {'kw1': 'a', 'kw2': 'b, (b, b=b), bb', 'kw3': 'c', 'kw4': '...'}

To explain the regex:

  • the (?=...) is a lookahead assertion to let you find overlapping matches
  • the ? in (.*?) makes the quantifier * (asterisk) non-greedy
  • the ?: makes the group (?:, kw.=|$) non-capturing
  • the |$ at the end allows to take account of the last value in your string

Answer to Q2:

No, this is wrong. The quantifier *? is non-greedy, so it finds the first match. Moreover there is no search for overlapping matches , which could be done with (?=...). So your observed result is the expected done. I may suggest you this simple solution:

msg = 'a, a, b=b, c=c'
result = re.findall(', ([^,]*?)=', msg)
print(result) # ['b', 'c']

CodePudding user response:

Q1: Is there a regex expression that helps me get above key and value?

To get the key:value in a dictionary format you can use

Say your string is

"kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=dd, kw10=jndn"

Using the following regex gives you key and values in a list

results = re.findall(r'(\bkw\d )=(.*?)(?=, \s*\bkw\d =|$)', s)

[('kw1', 'a'), ('kw2', 'b, (b, b=b), bb'), ('kw3', 'c'), ('kw4', 'dd'), ('kw10', 'jndn')]

You can convert it to a dictionary as

dict(results)

Output :

{
    'kw1': 'a', 
    'kw2': 'b, (b, b=b), bb', 
    'kw3': 'c', 
    'kw4': 'dd', 
    'kw10': 'jndn'
}

Explanation :

  • \b is used like a word boundary and will only match kw and not something like XYZkw

  • \kw\d = Match the word kw followed by 1 digits and =

  • .*? (Lazy Match) Match as least chars as possible

  • (?= Positive lookahead, assert to the right

    • \s*\bkw\d = Match optional whitespace chars, then pat, 1 digits and =
    • | Or
    • $ Assert the end of the string for the last part
  • ) Close the lookahead

  • Related