Home > Enterprise >  How to create a gender ratio table without one-hot encoding
How to create a gender ratio table without one-hot encoding

Time:07-28

I have two dataframes in Python, one with customer data and one with lists of purchased items. I want to create a table that shows the gender ratio for each item and median age.

Table 1 (which items that each customer owns) :

Customer_ID Item
1 [4, 99, 833]
2 [4, 44, 5 383, 11, 1000]
3 [3, 1, 5]
4 [40, 334, 884, 1]

Table 2 (Customer Data):

Customer_ID Gender Age
1 M 50
2 F 40
3 M 33
4 F 29

Desired Output:

Item Gender_Distribution_M [%] Gender_Distribution_F [%] Median_Age
1 x% y% z
..... . . .

One approach that I can think of is to do One-Hot encoding for all the items, but is there any easier way?

CodePudding user response:

IIUC, this should work for you:

df1 = pd.DataFrame({'Customer_ID': range(1, 5), 'Item': [[4, 99, 833], [4, 44, 5, 383, 11, 1000], [3, 1, 5], [40, 334, 884, 1]]})
df2 = pd.DataFrame({'Customer_ID': range(1, 5), 'Gender': ['M', 'F', 'M', 'F'], 'Age': [50, 40, 33, 29]})
df3 = pd.merge(df1.explode('Item'), df2, how='left', on='Customer_ID')

df = pd.DataFrame()
df['Gender_Distribution_M [%]'] = df3.groupby('Item').Gender.value_counts().unstack().M.div(df3.groupby('Item').Gender.value_counts().unstack().M.add(df3.groupby('Item').Gender.value_counts().unstack().F, fill_value=0), fill_value=0).mul(100)
df['Gender_Distribution_F [%]'] = df3.groupby('Item').Gender.value_counts().unstack().F.div(df3.groupby('Item').Gender.value_counts().unstack().M.add(df3.groupby('Item').Gender.value_counts().unstack().F, fill_value=0), fill_value=0).mul(100)
df['Median_Age'] = df3.groupby('Item').Age.median()

print(df)

Output:

      Gender_Distribution_M [%]  Gender_Distribution_F [%]  Median_Age
Item                                                                  
1                          50.0                       50.0        31.0
3                         100.0                        0.0        33.0
4                          50.0                       50.0        45.0
5                          50.0                       50.0        36.5
11                          0.0                      100.0        40.0
40                          0.0                      100.0        29.0
44                          0.0                      100.0        40.0
99                        100.0                        0.0        50.0
334                         0.0                      100.0        29.0
383                         0.0                      100.0        40.0
833                       100.0                        0.0        50.0
884                         0.0                      100.0        29.0
1000                        0.0                      100.0        40.0
  • Related