I have multiple columns in a pandas dataframe that I want to reduce from wide form to long form so that it essentially multiplies the number of rows in my dataframe by 2 and also adds a new column to indicate where each row comes from originally.
I have the following dataframe df
where cols a1
, b1
, c1
, and d1
all belong to one group:
name a1 b1 c1 d1 a2 b2 c2 d2
joe x y x y z e e f
lily x o x y z o e f
john o y x q z f e q
I want to transform it into the following final table with a new column to indicate where the values originated from
name a1 b1 c1 d1 new_col
joe x y x y group1
lily x o x y group1
john o y x q group1
joe z e e f group2
lily z o e f group2
john z f e q group2
I've tried using melt functions but can't seem to figure out how to do it for multiple variable pairs. For instance, I can do it for 2 columns but not all 8:
import pandas as pd
pd.melt(df, id_vars = 'name', var_name = 'a_var', value_vars = ['a1', 'a2'])
which results in
name a_var value
joe a1 x
lily a1 x
john a1 o
joe a2 z
lily a2 z
john a2 z
CodePudding user response:
Use wide_to_long
and create new columns with group
:
df = (pd.wide_to_long(df.reset_index(),
stubnames=['a','b','c','d'], i=['index','name'], j='new_col')
.droplevel(0)
.reset_index())
df['new_col'] = 'group' df['new_col'].astype(str)
print (df)
name new_col a b c d
0 joe group1 x y x y
1 joe group2 z e e f
2 lily group1 x o x y
3 lily group2 z o e f
4 john group1 o y x q
5 john group2 z f e q
EDIT:
print (df)
name var1_c var2_c var3_c var4_c var1_t var2_t var3_t var4_t
0 joe x y x y z e e f
1 lily x o x y z o e f
2 john o y x q z f e q
df = (pd.wide_to_long(df.reset_index(),
stubnames=['var1','var2','var3','var4'],
i=['index','name'],
j='new_col',
suffix='\w ',
sep='_')
.droplevel(0)
.reset_index())
df['new_col'] = 'group' df['new_col'].astype(str)
print (df)
name new_col var1 var2 var3 var4
0 joe groupc x y x y
1 joe groupt z e e f
2 lily groupc x o x y
3 lily groupt z o e f
4 john groupc o y x q
5 john groupt z f e q
Or:
df1 = df.set_index('name')
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack().rename_axis(['name','new_col']).reset_index()
df1['new_col'] = 'group' df1['new_col'].astype(str)
print (df1)
name new_col var1 var2 var3 var4
0 joe groupc x y x y
1 joe groupt z e e f
2 lily groupc x o x y
3 lily groupt z o e f
4 john groupc o y x q
5 john groupt z f e q
CodePudding user response:
One option is with pivot_longer from pyjanitor, where for this particular use case, you pass a regular expression with groups (to names_pattern
) to aggregate the desired column labels into new groups (in names_to
) - in this case we wish to keep the column label, so we use .value
as a placeholder to initiate that:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = 'name',
names_to = ('.value', 'new_col'),
names_pattern=r"(.)(.)")
.assign(new_col = lambda df: 'group' df.new_col)
)
name new_col a b c d
0 joe group1 x y x y
1 lily group1 x o x y
2 john group1 o y x q
3 joe group2 z e e f
4 lily group2 z o e f
5 john group2 z f e q