I received a dataset that include columns that previously transformed into one hot encoded. And I want to retrieve the old shape of it to do some preprocessing and filling NAs methods and of course read the stats model of the dataset.
The data columns I got:
team2_offensive_derived_var_0 | team2_offensive_derived_var_1 | team2_offensive_derived_var_2 | team2_offensive_derived_var_3 | team2_offensive_derived_var_4 | team2_offensive_derived_var_5 | team2_offensive_derived_var_6 | team2_offensive_derived_var_7 | team2_offensive_derived_var_8 | team2_offensive_derived_var_9 | team2_offensive_derived_var_10 |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
I want to transform it's shape into
row_id | team2_offensive_derived |
---|---|
0 | var 10 |
1 | var 9 |
2 | var 8 |
3 | var 7 |
4 | var 6 |
5 | var 5 |
6 | var 4 |
7 | var 3 |
8 | var 2 |
9 | var 1 |
10 | var 0 |
I also got a columns like:
team2_other_ratio_var_42 | team2_other_ratio_var_43 | team2_other_ratio_var_44 | team2_other_ratio_var_45 | team2_other_ratio_var_46 | team2_other_ratio_var_47 | team2_other_ratio_var_48 | team2_other_ratio_var_49 | team2_other_ratio_var_50 | team2_other_ratio_var_51 | team2_other_ratio_var_52 |
---|---|---|---|---|---|---|---|---|---|---|
0.0 | 0.400 | 0.200 | 0.000 | 0.750 | 0.250 | 0.341121 | 0.375 | 0.354167 | 0.184211 | 0.000 |
But I'm confused how should I retrieve it to it's original shape? "Categorical" But I don't know how?
Thank you all for your help
CodePudding user response:
You can use a stack
:
cols = ['row_id', 'team2_offensive_derived']
out = df.replace(0, pd.NA).stack().rename_axis(cols).reset_index()[cols]
output:
row_id team2_offensive_derived
0 0 team2_offensive_derived_var_10
1 1 team2_offensive_derived_var_9
2 2 team2_offensive_derived_var_8
3 3 team2_offensive_derived_var_7
4 4 team2_offensive_derived_var_6
5 5 team2_offensive_derived_var_5
6 6 team2_offensive_derived_var_4
7 7 team2_offensive_derived_var_3
8 8 team2_offensive_derived_var_2
9 9 team2_offensive_derived_var_1
10 10 team2_offensive_derived_var_0
With different column names:
out = (df
.replace(0, pd.NA)
.rename(columns=lambda x: x.replace('team2_offensive_derived_', ''))
.stack()
.rename_axis(cols)
.reset_index()[cols]
)
output:
row_id team2_offensive_derived
0 0 var_10
1 1 var_9
2 2 var_8
3 3 var_7
4 4 var_6
5 5 var_5
6 6 var_4
7 7 var_3
8 8 var_2
9 9 var_1
10 10 var_0
CodePudding user response:
Use:
#split by last previous _
df.columns = df.columns.str.rsplit('_', 2, expand=True)
#replace 0 to NaNs, so reshape remove rows with 0
df = df.replace(0, np.nan).stack([1,2])
#join MultiIndex value
df.index = df.index.map(lambda x: f'{x[1]} {x[2]}')
#create DataFrame
df = df.index.to_frame(name='team2_offensive_derived',index=False)
print (df)
team2_offensive_derived
0 var 10
1 var 9
2 var 8
3 var 7
4 var 6
5 var 5
6 var 4
7 var 3
8 var 2
9 var 1
10 var 0