Change a column containing list of dict to columns in a DataFrame-CodePudding

I have the following DataFrame which contains a column that is a list of dict items:

    d = pd.DataFrame([
        ['Green', [{'Desc:': 'STERLING GREEN SO'}, {'Sec:': '01'}, {'Lot:': 'L0038'}, {'Block:': 'B0008'}]],
        ['Apply', [{'Desc:': 'STERLING GREEN SO'}, {'Sec:': '01'}, {'Lot:': 'L0038'}, {'Block:': 'B0008'}]],
        ['Range', [{'Desc:': 'STERLING GREEN SO'}, {'Sec:': '01'}, {'Lot:': 'L0038'}, {'Block:': 'B0008'}]],
        ['Peop',  [{'Desc:': 'STERLING GREEN SO'}, {'Sec:': '01'}, {'Lot:': 'L0038'}, {'Block:': 'B0008'}]]
        ], columns=['Name', 'Legal Description'])

and I want to transform it to a simple DataFrame like so:

    d = pd.DataFrame([
        ['Green', 'STERLING GREEN SO', '01', 'L0038', 'B0008'],
        ['Apply', 'STERLING GREEN SO', '01', 'L0038', 'B0008'],
        ['Range', 'STERLING GREEN SO', '01', 'L0038', 'B0008'],
        ['Peop',  'STERLING GREEN SO', '01', 'L0038', 'B0008']
        ], columns=['Name', 'Legal Description', 'Desc', 'Sec', 'Lot', 'Block'])

CodePudding user response：

IMO, the ideal solution would be to act upstream and get a properly formatted dictionary or dataframe.

The issue with your list of single-keyed dictionaries is that you have to merge them. You can use a dictionary comprehension for that and convert to Series:

d2 = d['Legal Description'].apply(lambda c:
                                  pd.Series({next(iter(x.keys())).strip(':'):
                                             next(iter(x.values())) for x in c})
                                  )

Then join to the original dataframe:

d.drop(columns='Legal Description').join(d2)

output:

    Name               Desc Sec    Lot  Block
0  Green  STERLING GREEN SO  01  L0038  B0008
1  Apply  STERLING GREEN SO  01  L0038  B0008
2  Range  STERLING GREEN SO  01  L0038  B0008
3   Peop  STERLING GREEN SO  01  L0038  B0008

CodePudding user response：

If possible, you should wrangle your data before creating the DataFrame. It's faster than reshaping the DataFrame after being created. For instance, something like

data = [
    ['Green', [{'Desc:': 'STERLING GREEN SO'}, {'Sec:': '01'}, {'Lot:': 'L0038'}, {'Block:': 'B0008'}]],
    ['Apply', [{'Desc:': 'STERLING GREEN SO'}, {'Sec:': '01'}, {'Lot:': 'L0038'}, {'Block:': 'B0008'}]],
    ['Range', [{'Desc:': 'STERLING GREEN SO'}, {'Sec:': '01'}, {'Lot:': 'L0038'}, {'Block:': 'B0008'}]],
    ['Peop',  [{'Desc:': 'STERLING GREEN SO'}, {'Sec:': '01'}, {'Lot:': 'L0038'}, {'Block:': 'B0008'}]]
]

records = []

for name, legal_desc in data:
    rec = {'Name': name}
    rec.update(x for d in legal_desc for x in d.items())
    records.append(rec)
    
d = pd.DataFrame(records)

Output:

>>> d

    Name              Desc: Sec:   Lot: Block:
0  Green  STERLING GREEN SO   01  L0038  B0008
1  Apply  STERLING GREEN SO   01  L0038  B0008
2  Range  STERLING GREEN SO   01  L0038  B0008
3   Peop  STERLING GREEN SO   01  L0038  B0008

>>> records 

[{'Name': 'Green', 'Desc:': 'STERLING GREEN SO', 'Sec:': '01', 'Lot:': 'L0038', 'Block:': 'B0008'}, {'Name': 'Apply', 'Desc:': 'STERLING GREEN SO', 'Sec:': '01', 'Lot:': 'L0038', 'Block:': 'B0008'}, {'Name': 'Range', 'Desc:': 'STERLING GREEN SO', 'Sec:': '01', 'Lot:': 'L0038', 'Block:': 'B0008'}, {'Name': 'Peop', 'Desc:': 'STERLING GREEN SO', 'Sec:': '01', 'Lot:': 'L0038', 'Block:': 'B0008'}]

CodePudding user response：

You can also use:

df.set_index('Name', inplace=True)
df =  df['Legal Description'].explode().apply(pd.Series).groupby(level=0).sum().reset_index()

OUTPUT

    Name              Desc: Sec:   Lot: Block:
0  Apply  STERLING GREEN SO   01  L0038  B0008
1  Green  STERLING GREEN SO   01  L0038  B0008
2   Peop  STERLING GREEN SO   01  L0038  B0008
3  Range  STERLING GREEN SO   01  L0038  B0008