I have a list of "dictionary of dictionaries" that looks like this:
lis = [{'Health and Welfare Plan Change Notification': {'evidence_capture': 'null',
'test_result_justification': 'null',
'latest_test_result_date': 'null',
'last_updated_by': 'null',
'test_execution_status': 'Not Started',
'test_result': 'null'}},
{'Health and Welfare Plan Computations': {'evidence_capture': 'null',
'test_result_justification': 'null',
'latest_test_result_date': 'null',
'last_updated_by': 'null',
'test_execution_status': 'Not Started',
'test_result': 'null'}},
{'Health and Welfare Plan Data Agreements': {'evidence_capture': 'null',
'test_result_justification': 'Due to the Policy',
'latest_test_result_date': '2019-10-02',
'last_updated_by': 'null',
'test_execution_status': 'In Progress',
'test_result': 'null'}},
{'Health and Welfare Plan Data Elements': {'evidence_capture': 'null',
'test_result_justification': 'xxx',
'latest_test_result_date': '2019-10-02',
'last_updated_by': 'null',
'test_execution_status': 'In Progress',
'test_result': 'null'}},
{'Health and Welfare Plan Data Quality Monitoring': {'evidence_capture': 'null',
'test_result_justification': 'xxx',
'latest_test_result_date': '2019-08-09',
'last_updated_by': 'null',
'test_execution_status': 'Completed',
'test_result': 'xxx'}},
{'Health and Welfare Plan HPU Source Reliability': {'evidence_capture': 'null',
'test_result_justification': 'xxx.',
'latest_test_result_date': '2019-10-02',
'last_updated_by': 'null',
'test_execution_status': 'In Progress',
'test_result': 'null'}},
{'Health and Welfare Plan Lineage': {'evidence_capture': 'null',
'test_result_justification': 'null',
'latest_test_result_date': 'null',
'last_updated_by': 'null',
'test_execution_status': 'Not Started',
'test_result': 'null'}},
{'Health and Welfare Plan Metadata': {'evidence_capture': 'null',
'test_result_justification': 'Valid',
'latest_test_result_date': '2020-07-02',
'last_updated_by': 'null',
'test_execution_status': 'Completed',
'test_result': 'xxx'}},
{'Health and Welfare Plan Usage Reconciliation': {'evidence_capture': 'null',
'test_result_justification': 'Test out of scope',
'latest_test_result_date': '2019-10-02',
'last_updated_by': 'null',
'test_execution_status': 'In Progress',
'test_result': 'null'}}]
I would like to convert the list into a dataframe that looks like this:
evidence_capture last_updated_by latest_test_result_date test_execution_status test_result test_result_justification test_category
Change Notification null null null Not Started null null Health and Welfare Plan
Computations null null null Not Started null null Health and Welfare Plan
Data Agreements null null 2019-10-02 In Progress null Due to the Policy Health and Welfare Plan
Data Elements null null 2019-10-02 In Progress null xxx Health and Welfare Plan
Data Quality Monitoring null null 2019-08-09 Completed xxx xxx Health and Welfare Plan
HPU Source Reliability null null 2019-10-02 In Progress null xxx. Health and Welfare Plan
Lineage null null null Not Started null null Health and Welfare Plan
Metadata null null 2020-07-02 Completed xxx Valid Health and Welfare Plan
Usage Reconciliation null null 2019-10-02 In Progress null Test out of scope Health and Welfare Plan
My code to build the dataframe is using a for-loop to concat the records column by column. After that to process the column names, and then transpose it. The final output would have the repeated string "Health and Welfare Plan" removed from each row index, but appended as a new column.
df3 = pd.DataFrame(lis[0])
for i in range(1, len(lis)):
df3 = pd.concat([df3, pd.DataFrame(lis[i])], axis=1)
df3.columns = [col.split(' ')[1] for col in df3.columns]
df3 = df3.T
df3['test_category'] = 'Health and Welfare Plan'
print(df3)
The code is able to produce the final output, but using "expensive" functions of both for-loop and dataframe concat. So I was wondering if there is a better way to output the same results?
CodePudding user response:
You can build your dataframe out of the dictionary values and set the index manually by splitting by ' ':
df = pd.DataFrame([list(e.values())[0] for e in lis])
df.index = map(lambda x: x.split(' ')[-1].strip(), [list(e.keys())[0] for e in lis])
print(df)
Output:
evidence_capture test_result_justification ... test_execution_status test_result
Change Notification null null ... Not Started null
Computations null null ... Not Started null
Data Agreements null Due to the Policy ... In Progress null
Data Elements null xxx ... In Progress null
Data Quality Monitoring null xxx ... Completed xxx
HPU Source Reliability null xxx. ... In Progress null
Lineage null null ... Not Started null
Metadata null Valid ... Completed xxx
Usage Reconciliation null Test out of scope ... In Progress null
[9 rows x 6 columns]
Edit: it might be faster to do it all in one loop:
data = []
ind = []
for e in lis:
data.append(list(e.values())[0])
ind.append(list(e.keys())[0].split(' ')[1].strip())
df = pd.DataFrame(data, index=ind)
You can still add your new column at the end with df['test_category'] = 'Health and Welfare Plan'
CodePudding user response:
Let us do dict comp to flatten the list of dictionaries
pd.DataFrame({k.split(' ')[1]: v for d in lis for k, v in d.items()}).T
evidence_capture test_result_justification latest_test_result_date last_updated_by test_execution_status test_result
Change Notification null null null null Not Started null
Computations null null null null Not Started null
Data Agreements null Due to the Policy 2019-10-02 null In Progress null
Data Elements null xxx 2019-10-02 null In Progress null
Data Quality Monitoring null xxx 2019-08-09 null Completed xxx
HPU Source Reliability null xxx. 2019-10-02 null In Progress null
Lineage null null null null Not Started null
Metadata null Valid 2020-07-02 null Completed xxx
Usage Reconciliation null Test out of scope 2019-10-02 null In Progress null
CodePudding user response:
I have modified part of your code to get the same results. Using dictionary manipulation should be faster
import pandas as pd
dic = {}
for e in lis:
dic[list(e.keys())[0]] = list(e.values())[0]
df3 = pd.DataFrame(dic)
df3.columns = [col.split(' ')[-1].strip() for col in df3.columns]
df3 = df3.T
df3['test_category'] = 'Health and Welfare Plan'
print(df3)