I have two dataframes in Python, one with customer data and one with lists of purchased items. I want to create a table that shows the gender ratio for each item and median age.
Table 1 (which items that each customer owns) :
Customer_ID | Item |
---|---|
1 | [4, 99, 833] |
2 | [4, 44, 5 383, 11, 1000] |
3 | [3, 1, 5] |
4 | [40, 334, 884, 1] |
Table 2 (Customer Data):
Customer_ID | Gender | Age |
---|---|---|
1 | M |
50 |
2 | F |
40 |
3 | M |
33 |
4 | F |
29 |
Desired Output:
Item | Gender_Distribution_M [%] | Gender_Distribution_F [%] | Median_Age |
---|---|---|---|
1 | x% | y% | z |
..... | . | . | . |
One approach that I can think of is to do One-Hot encoding for all the items, but is there any easier way?
CodePudding user response:
IIUC, this should work for you:
df1 = pd.DataFrame({'Customer_ID': range(1, 5), 'Item': [[4, 99, 833], [4, 44, 5, 383, 11, 1000], [3, 1, 5], [40, 334, 884, 1]]})
df2 = pd.DataFrame({'Customer_ID': range(1, 5), 'Gender': ['M', 'F', 'M', 'F'], 'Age': [50, 40, 33, 29]})
df3 = pd.merge(df1.explode('Item'), df2, how='left', on='Customer_ID')
df = pd.DataFrame()
df['Gender_Distribution_M [%]'] = df3.groupby('Item').Gender.value_counts().unstack().M.div(df3.groupby('Item').Gender.value_counts().unstack().M.add(df3.groupby('Item').Gender.value_counts().unstack().F, fill_value=0), fill_value=0).mul(100)
df['Gender_Distribution_F [%]'] = df3.groupby('Item').Gender.value_counts().unstack().F.div(df3.groupby('Item').Gender.value_counts().unstack().M.add(df3.groupby('Item').Gender.value_counts().unstack().F, fill_value=0), fill_value=0).mul(100)
df['Median_Age'] = df3.groupby('Item').Age.median()
print(df)
Output:
Gender_Distribution_M [%] Gender_Distribution_F [%] Median_Age
Item
1 50.0 50.0 31.0
3 100.0 0.0 33.0
4 50.0 50.0 45.0
5 50.0 50.0 36.5
11 0.0 100.0 40.0
40 0.0 100.0 29.0
44 0.0 100.0 40.0
99 100.0 0.0 50.0
334 0.0 100.0 29.0
383 0.0 100.0 40.0
833 100.0 0.0 50.0
884 0.0 100.0 29.0
1000 0.0 100.0 40.0