Home > database >  Python regex query to parse very simple dictionary
Python regex query to parse very simple dictionary

Time:11-12

I am new to regex module and learning a simple case to extract key and values from a simple dictionary.

the dictionary can not contain nested dicts and any lists, but may have simple tuples

MWE

import re

# note: the dictionary are simple and does NOT contains list, nested dicts, just these two example suffices for the regex matching.
d = "{'a':10,'b':True,'c':(5,'a')}" # ['a', 10, 'b', True, 'c', (5,'a') ]
d = "{'c':(5,'a'), 'd': 'TX'}" # ['c', (5,'a'),  'd', 'TX']

regexp = r"(.*):(.*)" # I am not sure how to repeat this pattern separated by ,

out = re.match(regexp,d).groups()
out

CodePudding user response:

You should not use regex for this job. When the input string is valid Python syntax, you can use ast.literal_eval.

Like this:

import ast
# ...
out = ast.literal_eval(d)

Now you have a dictionary object in Python. You can for instance get the key/value pairs in a (dict_items) list:

print(out.items())

Regex

Regex is not the right tool. There will always be cases where some boundary case will be wrongly parsed. But to get the repeated matches, you can better use findall. Here is a simple example regex:

regexp = r"([^{\s][^:]*):([^:}]*)(?:[,}])"
out = re.findall(regexp, d)

This will give a list of pairs.

CodePudding user response:

Regex would be hard (perhaps impossible, but I'm not versed enough to say confidently) to use because of the ',' nested in your tuples. Just for the sake of it, I wrote (regex-less) code to parse your string for separators, ignoring parts inside parentheses:

d = "{'c':(5,'a',1), 'd': 'TX', 1:(1,2,3)}" 

d=d.replace("{","").replace("}","")
indices = []
inside = False
for i,l in enumerate(d):
    if inside:
        if l == ")":
            inside = False
            continue
        continue
    if l == "(":
        inside = True
        continue
    if l in {":",","}:
        indices.append(i)
indices.append(len(d))
parts = []
start = 0
for i in indices:
    parts.append(d[start:i].strip())
    start = i 1

parts
#  ["'c'", "(5,'a',1)", "'d'", "'TX'", '1', '(1,2,3)']
  • Related