I have a sample dataframe, df, where I would like to maintain all columns in my original dataset and create a new pivoted column based on existing dataset. (grouped by id) Original dataframe has about 80 columns
Data
s1 s2 id q122c q122a q122c2 q122a2 q222c q222a q222c2 q222a2 q322c q322a q322c2 q322a2
50 1000 aa 40 10 900 100 50 0 1000 0 60 -10 1100 -100
100 2000 bb 50 50 1500 500 75 25 1900 100 100 0 2000 0
Desired
s1 s2 id date consumed1 avail1 consumed2 avail2
50 1000 aa q122 40 10 900 100
50 1000 aa q222 50 0 1000 0
50 1000 aa q322 60 -10 1100 -100
100 2000 bb q122 50 50 1500 500
100 2000 bb q222 75 25 1900 100
100 2000 bb q322 100 0 2000 0
Doing
pd.wide_to_long(df.set_index(['stat1','stat2','id']).stack().groupby(level=[0,1,2,3]).agg(list).apply(pd.Series).unstack().stack(level=0).reset_index(),
stubnames = ['q122','q222'], i = ['stat1','stat2','id','level_3'],j = 'date',suffix='\\w ').stack().unstack(level=[-3,-2])
A SO member has provided a script that will take the data from wide to long, however, the output I am getting is not capturing all the columns needed.
Any suggestion is appreciated.
CodePudding user response:
There are a couple of ways to go about this:
One option is to convert the columns to a MultiIndex, then reshape with stack:
index = ['s1', 's2', 'id']
temp = df.set_index(index)
temp.columns = temp.columns.str.split(r"([a|c]\d?)", expand = True).droplevel(-1)
temp.columns.names = ['date', None]
(temp.stack(level='date')
.rename(columns={'c':'consumed1',
'a':'avail1',
'c2':'consumed2',
'a2':'avail2'})
.reset_index()
)
s1 s2 id date avail1 avail2 consumed1 consumed2
0 50 1000 aa q122 10 100 40 900
1 50 1000 aa q222 0 0 50 1000
2 50 1000 aa q322 -10 -100 60 1100
3 100 2000 bb q122 50 500 50 1500
4 100 2000 bb q222 25 100 75 1900
5 100 2000 bb q322 0 0 100 2000
with pd.wide_to_long
:
index = ['s1', 's2', 'id']
temp = df.set_index(index)
temp.columns = temp.columns.str.split(r"([a|c]\d?)").str[::-1].str[1:].str.join('_')
(pd.wide_to_long(temp.reset_index(),
stubnames = ['c', 'a', 'c2', 'a2'],
i = index,
j = 'date',
sep='_',
suffix='. ')
.rename(columns={'c':'consumed1',
'a':'avail1',
'c2':'consumed2',
'a2':'avail2'})
.reset_index()
)
s1 s2 id date consumed1 avail1 consumed2 avail2
0 50 1000 aa q122 40 10 900 100
1 50 1000 aa q222 50 0 1000 0
2 50 1000 aa q322 60 -10 1100 -100
3 100 2000 bb q122 50 50 1500 500
4 100 2000 bb q222 75 25 1900 100
5 100 2000 bb q322 100 0 2000 0
Another option is with pivot_longer
from pyjanitor
, that offers a simpler way for such complex reshaping :
#pip install pyjanitor
import pandas as pd
import janitor
(df.pivot_longer(slice('s1', 'id'),
names_to = ('date', '.value'),
names_pattern = r"([a-z]\d )(. )",
sort_by_appearance=True)
.rename(columns={'c':'consumed1',
'a':'avail1',
'c2':'consumed2',
'a2':'avail2'})
)
s1 s2 id date consumed1 avail1 consumed2 avail2
0 50 1000 aa q122 40 10 900 100
1 50 1000 aa q222 50 0 1000 0
2 50 1000 aa q322 60 -10 1100 -100
3 100 2000 bb q122 50 50 1500 500
4 100 2000 bb q222 75 25 1900 100
5 100 2000 bb q322 100 0 2000 0
The names_pattern
indicates how the columns will be reshaped. there are two values in names_to
: ('date', '.value'). There are also two regex groups in names_pattern
. The first regex group will be paired with date
to form a new column, while the second regex group will be paired with .value
to form new headers.