I have a dataframe where I would like to maintain all columns in my original dataset and create a new pivoted column based on existing dataset.
Data
stat1 stat2 id q122con q122av q122con q122av q222con q222av q222con q222av
50 1000 aa 40 10 900 100 50 0 1000 0
100 2000 bb 50 50 1500 500 75 25 1900 100
Desired
stat1 stat2 id date con av con av
50 1000 aa q122 40 10 900 100
50 1000 aa q222 50 0 1000 0
100 2000 bb q122 50 50 1500 500
100 2000 bb q222 75 25 1900 100
Doing
df.pivot(index="id", columns="date", values=["con", "av"])
However, I am not obtaining the full columns within my dataset. Any suggestion is appreciated.
CodePudding user response:
Some long one line by wide_to_long
pd.wide_to_long(df.set_index(['stat1','stat2','id']).stack().groupby(level=[0,1,2,3]).agg(list).apply(pd.Series).unstack().stack(level=0).reset_index(),
stubnames = ['q122','q222'], i = ['stat1','stat2','id','level_3'],j = 'date',suffix='\\w ').stack().unstack(level=[-3,-2])
Out[140]:
level_3 0 1
date av con av con
stat1 stat2 id
50 1000 aa q122 10 40 100 900
q222 0 50 0 1000
100 2000 bb q122 50 50 500 1500
q222 25 75 100 1900
CodePudding user response:
Lot of your issues here is dealing with duplicate column names:
import pandas as pd
# Duplicating input dataframe with clipboard and remove dot numbers assign for duplicate column headers
df = pd.read_clipboard()
df.columns = df.columns.str.split('.').str[0]
# Set index to move first three columns into index
df = df.set_index(['stat1','stat2','id'])
# Use groupby and cumcount to get order of duplicate column headers
cols = df.groupby(df.columns, axis=1).cumcount().rename('No').reset_index()
# Use str.extract to split "dates" from av and con with regex
cols = cols['index'].str.extract('(q\d{3})(.*)').join(cols).drop('index', axis=1)
# Create a new multiIndex column header
df.columns = pd.MultiIndex.from_frame(cols, names=['date','av','con'])
# Reshape dataframe by stacking the outer most column header to the dataframe index
# And moved those columns from the index back into the dataframe with reset_index
df_out = df.stack(0).reset_index()
# Flatten headers back to one level
df_out.columns = [f'{i}_{j}' if j else f'{i}' for i, j in df_out.columns]
# And print
print(df_out)
Output:
stat1 stat2 id date av av_1 con con_1
0 50 1000 aa q122 10 100 40 900
1 50 1000 aa q222 0 0 50 1000
2 100 2000 bb q122 50 500 50 1500
3 100 2000 bb q222 25 100 75 1900