Need help on the below nested dictionary and I want to convert this to a pandas Data Frame
My JSON have the following instance of CPU data and comes with random occurrence:
Instance1 [{'datapoints': [{'statistic': 'Minimum', 'timestamp': '2021-08-31 06:50:00.000000', 'value': 59.03}, {'statistic': 'Minimum', 'timestamp': '2021-08-18 02:50:00.000000', 'value': 59.37}, {'statistic': 'Minimum', 'timestamp': '2021-08-24 16:50:00.000000', 'value': 58.84},, 'metric': 'VolumeIdleTime', 'unit': 'Seconds'}]
Instance2 [{'datapoints': [{'statistic': 'Minimum', 'timestamp': '2021-08-31 06:50:00.000000', 'value': 60}, {'statistic': 'Minimum', 'timestamp': '2021-08-18 02:50:00.000000', 'value': 55.45}, {'statistic': 'Minimum', 'timestamp': '2021-08-24 16:50:00.000000', 'value': 54.16}, {'statistic': 'Minimum', 'timestamp': '2021-08-06 07:50:00.000000', 'value': 50.03}, {'statistic': 'Minimum', 'timestamp': '2021-08-04 22:50:00.000000', 'value': 60}, {'statistic': 'Minimum', 'timestamp': '2021-08-26 01:50:00.000000', 'value': 60.34}, 'metric': 'VolumeIdleTime', 'unit': 'Seconds'}]
Instance3 [{'datapoints': [{'statistic': 'Minimum', 'timestamp': '2021-08-31 06:50:00.000000', 'value': 60}, {'statistic': 'Minimum', 'timestamp': '2021-08-18 02:50:00.000000', 'value': 38.12}, {'statistic': 'Minimum', 'timestamp': '2021-08-24 16:50:00.000000', 'value': 42.31}, {'statistic': 'Minimum', 'timestamp': '2021-08-06 07:50:00.000000', 'value': 45.22}, {'statistic': 'Minimum', 'timestamp': '2021-08-04 22:50:00.000000', 'value': 40.51}, {'statistic': 'Minimum', 'timestamp': '2021-08-26 01:50:00.000000', 'value': 34.35}, {'statistic': 'Minimum', 'timestamp': '2021-08-11 12:50:00.000000', 'value': 46.33},'metric': 'VolumeIdleTime', 'unit': 'Seconds'}]
And many more instance details to follow ( Close to 8K instance information )
Question : How to advance the below coding so it can fetch from all rows of every instance :
data = [{'datapoints': [{'statistic': 'Minimum', 'timestamp': '2021-08-31 06:50:00.000000', 'value': 59.03},{'statistic': 'Minimum', 'timestamp': '2021-08-18 02:50:00.000000', 'value': 59.37}, {'statistic': 'Minimum', 'timestamp': '2021-08-24 16:50:00.000000', 'value': 58.84}],'metric': 'VolumeIdleTime', 'unit': 'Seconds'}]
df = pd.json_normalize(data, record_path="datapoints", meta=["metric", "unit"])
print(df)
CodePudding user response:
I don't really understand what the problem is, but if you method doesn't work for all rows, do this:
If your data looks like:
data = [{'datapoints': [{'statistic': 'Minimum', 'timestamp': '2021-08-31 06:50:00.000000', 'value': 59.03},{'statistic': 'Minimum', 'timestamp': '2021-08-18 02:50:00.000000', 'value': 59.37}, {'statistic': 'Minimum', 'timestamp': '2021-08-24 16:50:00.000000', 'value': 58.84}],'metric': 'VolumeIdleTime', 'unit': 'Seconds'}]
df = pd.DataFrame(data)
that is:
datapoints metric unit
0 [{'statistic': 'Minimum', 'timestamp': '2021-0... VolumeIdleTime Seconds
You can define the following function:
def flatten_nested_json_df(df):
df = df.reset_index()
s = (df.applymap(type) == list).all()
list_columns = s[s].index.tolist()
s = (df.applymap(type) == dict).all()
dict_columns = s[s].index.tolist()
while len(list_columns) > 0 or len(dict_columns) > 0:
new_columns = []
for col in dict_columns:
horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
horiz_exploded.index = df.index
df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
new_columns.extend(horiz_exploded.columns) # inplace
for col in list_columns:
#print(f"exploding: {col}")
df = df.drop(columns=[col]).join(df[col].explode().to_frame())
new_columns.append(col)
s = (df[new_columns].applymap(type) == list).all()
list_columns = s[s].index.tolist()
s = (df[new_columns].applymap(type) == dict).all()
dict_columns = s[s].index.tolist()
return df
and apply it:
flatten_nested_json_df(df)
which returns:
index metric unit datapoints.statistic \
0 0 VolumeIdleTime Seconds Minimum
0 0 VolumeIdleTime Seconds Minimum
0 0 VolumeIdleTime Seconds Minimum
datapoints.timestamp datapoints.value
0 2021-08-31 06:50:00.000000 59.03
0 2021-08-18 02:50:00.000000 59.37
0 2021-08-24 16:50:00.000000 58.84
CodePudding user response:
below code will loop through all your instance and concat them all to one big dataframe:
import pandas as pd
df = pd.json_normalize(instance1, record_path="datapoints", meta=["metric", "unit"])
for instance in your_remained_instances:
df = pd.concat([df, pd.json_normalize(instance, record_path="datapoints", meta=["metric", "unit"])])