Home > Software design >  Pandas long to wide with hierarchical column headers
Pandas long to wide with hierarchical column headers

Time:06-02

I've looked up and down but can't find an example that closely resembles what I'm trying to do. In short, I have multiple trials of an experiment with multiple variables. The data is collected and stored in a long dataframe object, like so (generated data example):

Input:

import pandas as pd
import itertools
from scipy import rand

trials = ['trial_1' , 'trial_2']
ppi = ['5_ppi', '10_ppi']
rd = ['7%', '12%']
iter = [rd, ppi, trials]
df = pd.DataFrame()

for k, params in enumerate(list(itertools.product(*iter))):
    ext_data = [rand(), rand()]
    load_data = [rand(), rand()]
    df1 = pd.DataFrame({'ext': ext_data, 'load': load_data, 'trial': params[2], 'ppi': params[1], 'rd': params[0]})
    df = pd.concat([df, df1], axis=0)

print(df)

Output:

df
        ext      load    trial     ppi   rd
0  0.287997  0.874457  trial_1   5_ppi   7%
1  0.783776  0.878291  trial_1   5_ppi   7%
0  0.015054  0.886801  trial_2   5_ppi   7%
1  0.243617  0.234560  trial_2   5_ppi   7%
0  0.291621  0.519084  trial_1  10_ppi   7%
1  0.627786  0.072551  trial_1  10_ppi   7%
0  0.349199  0.235718  trial_2  10_ppi   7%
1  0.284535  0.328547  trial_2  10_ppi   7%
0  0.725747  0.688157  trial_1   5_ppi  12%
1  0.656839  0.297645  trial_1   5_ppi  12%
0  0.534276  0.794199  trial_2   5_ppi  12%
1  0.680596  0.381575  trial_2   5_ppi  12%
0  0.494404  0.246841  trial_1  10_ppi  12%
1  0.148489  0.549250  trial_1  10_ppi  12%
0  0.791440  0.372119  trial_2  10_ppi  12%
1  0.078047  0.552541  trial_2  10_ppi  12%

[16 rows x 5 columns]

(in reality there are hundreds of data points for each trial, not just two like shown here)

I would like to convert this to a wide dataframe with hierarchical columns, organized like so (i can't use markdown to make values/headers span multiple columns, so please forgive the screenshot)

enter image description here

I've tried so many things that I've lost track. I think pivot_table got me the closest to what I want, but it winds up aggregating the mean of the values instead of listing them out, like so:

Input:

df.pivot_table(columns=['rd', 'ppi', 'trial'])

Output:

rd          12%                                      7%                             
ppi      10_ppi               5_ppi              10_ppi              5_ppi          
trial   trial_1   trial_2   trial_1   trial_2   trial_1  trial_2   trial_1   trial_2
ext    0.619217  0.661812  0.652555  0.167241  0.340024  0.42324  0.565166  0.436858
load   0.397430  0.102965  0.528162  0.550871  0.206560  0.28524  0.731204  0.303079

[2 rows x 8 columns]

It is also possible that each set of data has a different number of data points (i.e. Trial 1 might contain 2 data points, while Trial 2 might have 3). I'm at my wits end. What is the magic command to turn this long dataframe object into a wide dataframe object with multiple hierarchical columns?

CodePudding user response:

This may not be the most efficient answer, but I think it is achieving your desired outcome and should work for any number of observations.

def reformat(df, column_name):
  groups = df.groupby(['rd','ppi','trial'])[column_name].apply(list)
  temp_1 = groups.reset_index(name = 'listvalues')


  # Make names for each of your new columns
  col_names = [column_name   str(i) for i in np.arange(len(temp_1['listvalues'][0]))]
  # Split listvalues into df where every item is its own column
  listvalues = pd.DataFrame(temp_1["listvalues"].to_list(), columns=[col_names])

  # Merge listvalues with the temp_1 df and get rid of the extra listvalues column
  df = temp_1.join(listvalues)
  del df['listvalues']

  df = df.pivot_table(columns=['rd', 'ppi', 'trial'])
  return df

# Do this twice and append to get the desired dataframe
df2 = reformat(df, 'ext')
df3 = reformat(df, 'load')
finaldf = df2.append(df3)

Where finaldf looks like:

enter image description here

CodePudding user response:

try this:

idx_cols = [*df][-1:-4:-1]
res = df.set_index(idx_cols   [df.groupby(idx_cols).cumcount()])
res = res.stack().unstack([0,1,2,-1])
print(res)

enter image description here

  • Related