Home > database >  Need to retrieve original data shape from one hot encoding shape
Need to retrieve original data shape from one hot encoding shape

Time:06-24

I received a dataset that include columns that previously transformed into one hot encoded. And I want to retrieve the old shape of it to do some preprocessing and filling NAs methods and of course read the stats model of the dataset.

The data columns I got:

team2_offensive_derived_var_0 team2_offensive_derived_var_1 team2_offensive_derived_var_2 team2_offensive_derived_var_3 team2_offensive_derived_var_4 team2_offensive_derived_var_5 team2_offensive_derived_var_6 team2_offensive_derived_var_7 team2_offensive_derived_var_8 team2_offensive_derived_var_9 team2_offensive_derived_var_10
0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0

I want to transform it's shape into

row_id team2_offensive_derived
0 var 10
1 var 9
2 var 8
3 var 7
4 var 6
5 var 5
6 var 4
7 var 3
8 var 2
9 var 1
10 var 0

I also got a columns like:

team2_other_ratio_var_42 team2_other_ratio_var_43 team2_other_ratio_var_44 team2_other_ratio_var_45 team2_other_ratio_var_46 team2_other_ratio_var_47 team2_other_ratio_var_48 team2_other_ratio_var_49 team2_other_ratio_var_50 team2_other_ratio_var_51 team2_other_ratio_var_52
0.0 0.400 0.200 0.000 0.750 0.250 0.341121 0.375 0.354167 0.184211 0.000

But I'm confused how should I retrieve it to it's original shape? "Categorical" But I don't know how?

Thank you all for your help

CodePudding user response:

You can use a stack:

cols = ['row_id', 'team2_offensive_derived']
out = df.replace(0, pd.NA).stack().rename_axis(cols).reset_index()[cols]

output:

    row_id         team2_offensive_derived
0        0  team2_offensive_derived_var_10
1        1   team2_offensive_derived_var_9
2        2   team2_offensive_derived_var_8
3        3   team2_offensive_derived_var_7
4        4   team2_offensive_derived_var_6
5        5   team2_offensive_derived_var_5
6        6   team2_offensive_derived_var_4
7        7   team2_offensive_derived_var_3
8        8   team2_offensive_derived_var_2
9        9   team2_offensive_derived_var_1
10      10   team2_offensive_derived_var_0

With different column names:

out = (df
       .replace(0, pd.NA)
       .rename(columns=lambda x: x.replace('team2_offensive_derived_', ''))
       .stack()
       .rename_axis(cols)
       .reset_index()[cols]
      )

output:

    row_id team2_offensive_derived
0        0                  var_10
1        1                   var_9
2        2                   var_8
3        3                   var_7
4        4                   var_6
5        5                   var_5
6        6                   var_4
7        7                   var_3
8        8                   var_2
9        9                   var_1
10      10                   var_0

CodePudding user response:

Use:

#split by last previous _
df.columns = df.columns.str.rsplit('_', 2, expand=True)
#replace 0 to NaNs, so reshape remove rows with 0
df = df.replace(0, np.nan).stack([1,2])
#join MultiIndex value
df.index = df.index.map(lambda x: f'{x[1]} {x[2]}')
#create DataFrame
df = df.index.to_frame(name='team2_offensive_derived',index=False)

print (df)
   team2_offensive_derived
0                   var 10
1                    var 9
2                    var 8
3                    var 7
4                    var 6
5                    var 5
6                    var 4
7                    var 3
8                    var 2
9                    var 1
10                   var 0
  • Related