Converting a nested dictionary structure to a Pandas dataframe?-CodePudding

I have a list of nested dictionary structures that looks like this:

{'1278.1': 
  {'Time Distribution': 'Exponential', 
    'Time Distribution Parameters': {'Equivalent Lambda': 950.486, 'Average Packet Lambda': 0.950486, 'Exponential Max Factor': 10.0}, 
   'Size Distribution': 'Binomial', x
    'Size Distribution Parameters': {'Average Packet Size': 1000.0, 'Packet Size 1': 300.0, 'Packet Size 2': 1700.0}}}

The first numerical value (here shown as '1278.1') is known as the max avg lambda value. I would like to create a data frame that is formatted with columns such as:

Max Avg Lamba Time Distribution Equivalent Lambda Average Packet Lambda ... Size Distribution ... Packet Size 2

How is this possible? Also, the data that I am working with does not always have the same Time Distribution Parameters or Size Distribution Parameters. For instance, there might sometimes be a Packet Size 3 at times, but not always. How can I create a dataframe where some of these values are empty when something like Packet Size 3 is not present?

CodePudding user response：

This might already have an answer here

Answer on the link above says you can directly input a dictionary to the pd.DataFrame function and it will spit out a data frame of the input dict.

The code below should correctly format the above dict and change it into a format that allows the DataFrame method to read it correctly.

import copy
import pandas as pd

d = {
  "1278.1": {"Time Distribution": "Exponential", 
      "Time Distribution Parameters": {"Equivalent Lambda": 950.486, "Average Packet Lambda": 0.950486, "Exponential Max Factor": 10.0
        }, 
     "Size Distribution": "Binomial",
      "Size Distribution Parameters": {"Average Packet Size": 1000.0, "Packet Size 1": 300.0, "Packet Size 2": 1700.0
        }
    }
}

# Convert to list to get keys(max avg lambdas)
max_avg_lambdas = list(d)

list_of_dicts = []

# If there are more than 1 keys iterate and create new dict
for max_avg_lambda in max_avg_lambdas:
  # Create new key/value pair of the max avg lambda inside of Time dist parameters
  d[max_avg_lambda]["Time Distribution Parameters"]["Max Avg Lambda"] = max_avg_lambda

  # Create a new dict with contents of max_avg_lambda key dict
  fixed_dict = copy.deepcopy(d[max_avg_lambda])

  # Append dict to a list of dicts
  list_of_dicts.append(fixed_dict)

for info_dict in list_of_dicts:
  df = pd.DataFrame(info_dict)

  with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df)

print(fixed_dict)

Output dict

{
    "Time Distribution": "Exponential",
    "Time Distribution Parameters": {
        "Max Avg Lambda": "1278.1",
        "Equivalent Lambda": 950.486,
        "Average Packet Lambda": 0.950486,
        "Exponential Max Factor": 10.0
    },
    "Size Distribution": "Binomial",
    "Size Distribution Parameters": {
        "Average Packet Size": 1000.0,
        "Packet Size 1": 300.0,
        "Packet Size 2": 1700.0
    }
}

Out:

                       Time Distribution Time Distribution Parameters  \
Equivalent Lambda            Exponential                      950.486   
Average Packet Lambda        Exponential                     0.950486   
Exponential Max Factor       Exponential                         10.0   
Max Avg Lambda               Exponential                       1278.1   
Average Packet Size          Exponential                          NaN   
Packet Size 1                Exponential                          NaN   
Packet Size 2                Exponential                          NaN   

                       Size Distribution  Size Distribution Parameters  
Equivalent Lambda               Binomial                           NaN  
Average Packet Lambda           Binomial                           NaN  
Exponential Max Factor          Binomial                           NaN  
Max Avg Lambda                  Binomial                           NaN  
Average Packet Size             Binomial                        1000.0  
Packet Size 1                   Binomial                         300.0  
Packet Size 2                   Binomial                        1700.0

CodePudding user response：

pd.json_normalize() allows to flatten nested data into pandas columns. If Packet Size 3 is available in some rows but not in others, the missing values will be represented as np.nan. A possible workflow would be:

import pandas as pd

data = {'1278.1': {'Time Distribution': 'Exponential', 'Time Distribution Parameters': {'Equivalent Lambda': 950.486, 'Average Packet Lambda': 0.950486, 'Exponential Max Factor': 10.0}, 'Size Distribution': 'Binomial', 'Size Distribution Parameters': {'Average Packet Size': 1000.0, 'Packet Size 1': 300.0, 'Packet Size 2': 1700.0}}}

#read dataframe with Max Avg Lamba as index, then reset index to column
df = pd.DataFrame.from_dict(data,orient='index').reset_index().rename(columns={'index': 'Max Avg Lamba'})

#flatten Time Distribution Parameters and Size Distribution Parameters, join with dataframe
df = df.join(pd.json_normalize(df['Time Distribution Parameters']))
df = df.join(pd.json_normalize(df['Size Distribution Parameters']))

#remove redundant columns
df = df.drop(columns=['Time Distribution Parameters', 'Size Distribution Parameters'])

Output:

	Max Avg Lamba	Time Distribution	Size Distribution	Equivalent Lambda	Average Packet Lambda	Exponential Max Factor	Average Packet Size	Packet Size 1	Packet Size 2
0	1278.1	Exponential	Binomial	950.486	0.950486	10	1000	300	1700