I am new to python and trying to learn the regex by example. In this example I am trying the extract the dictionary parts from the multiline text. How to extract the parts between the two braces in the following example?

MWE: How to get pandas dataframe from this data?

import re

s = """
[
          {
            specialty: "Anatomic/Clinical Pathology",
            one: " 12,643 ",
            two: " 8,711 ",
            three: " 385 ",
            four: " 520 ",
            five: " 3,027 ",
          },
          {
            specialty: "Nephrology",
            one: " 11,407 ",
            two: " 9,964 ",
            three: " 140 ",
            four: " 316 ",
            five: " 987 ",
          },
          {
            specialty: "Vascular Surgery",
            one: " 3,943 ",
            two: " 3,586 ",
            three: " 48 ",
            four: " 13 ",
            five: " 296 ",
          },
        ]
"""

m = re.match('({.*})', s, flags=re.S)
data = m.groups()
df = pd.DataFrame(data)

CodePudding user response：

I suggest to add double quotes around the keys, then cast the string to a list of dictionaries and then simply read the structure into pandas dataframe using pd.from_dict:

import pandas as pd
from ast import literal_eval
import re

s = "YOU STRING HERE"
fixed_s = re.sub(r"^(\s*)(\w ):", r'\1"\2":', s, flags=re.M)
df = pd.DataFrame.from_dict( ast.literal_eval(fixed_s) )

The ^(\s*)(\w ): regex matches zero or more whitespaces at the start of any line (see the flags=re.M that makes ^ match start of any line positions) capturing them into Group 1, and then matches one or more word chars capturing them into Group 2 and then matches a : and then replaces the match with Group 1 " Group 2 ":.

The result is cast to a list of dictionaries using ast.literal_eval.

Then, the list is used to initialize the dataframe.