I am new to python and trying to learn the regex by example. In this example I am trying the extract the dictionary parts from the multiline text. How to extract the parts between the two braces in the following example?
MWE: How to get pandas dataframe from this data?
import re
s = """
[
{
specialty: "Anatomic/Clinical Pathology",
one: " 12,643 ",
two: " 8,711 ",
three: " 385 ",
four: " 520 ",
five: " 3,027 ",
},
{
specialty: "Nephrology",
one: " 11,407 ",
two: " 9,964 ",
three: " 140 ",
four: " 316 ",
five: " 987 ",
},
{
specialty: "Vascular Surgery",
one: " 3,943 ",
two: " 3,586 ",
three: " 48 ",
four: " 13 ",
five: " 296 ",
},
]
"""
m = re.match('({.*})', s, flags=re.S)
data = m.groups()
df = pd.DataFrame(data)
CodePudding user response:
I suggest to add double quotes around the keys, then cast the string to a list of dictionaries and then simply read the structure into pandas dataframe using pd.from_dict
:
import pandas as pd
from ast import literal_eval
import re
s = "YOU STRING HERE"
fixed_s = re.sub(r"^(\s*)(\w ):", r'\1"\2":', s, flags=re.M)
df = pd.DataFrame.from_dict( ast.literal_eval(fixed_s) )
The ^(\s*)(\w ):
regex matches zero or more whitespaces at the start of any line (see the flags=re.M
that makes ^
match start of any line positions) capturing them into Group 1, and then matches one or more word chars capturing them into Group 2 and then matches a :
and then replaces the match with Group 1 "
Group 2 ":
.
The result is cast to a list of dictionaries using ast.literal_eval
.
Then, the list is used to initialize the dataframe.