I have this dataframe and I want to transform it into another dataframe with a column which combines observations from several columns in the first dataframe and aggregates values from the column "points". Here's the dataframe and below is the desired result:
player_data = pd.DataFrame({"customer_id": ["100001", "100002", "100005", "100006", "100007", "100011", "100012",
"100013", "100022", "100023", "100025", "100028", "100029", "100030"],
"country": ["Austria", "Germany", "Germany", "Sweden", "Sweden", "Austria", "Sweden",
"Austria", "Germany", "Germany", "Austria", "Austria", "Germany", "Austria"],
"category": ["basic", "pro", "basic", "advanced", "pro", "intermidiate", "pro",
"basic", "intermidiate", "intermidiate", "advanced", "basic", "intermidiate", "basic"],
"gender": ["male", "male", "female", "female", "female", "male", "female",
"female", "male", "male", "female", "male", "male", "male"],
"age_group": ["20", "30", "20", "30", "40", "20", "40",
"20", "30", "30", "40", "20", "30", "20"],
"points": [200, 480, 180, 330, 440, 240, 520, 180, 320, 300, 320, 200, 280, 180]})
The new dataframe is supposed to look like this:
Thank you all!
CodePudding user response:
Would this be what you are looking for?
df_new = df.groupby(['country', 'category', 'gender', 'age_group'])['points'].agg('sum').reset_index()
df_new.pivot_table(values = 'points', index = ['country', 'category', 'gender'], columns = 'age_group', fill_value = 0).reset_index().sort_values(['country', 'category', 'gender'])
However, this will not have any columns that have only 0s for example Australia | Advanced | M will not be in here since there wasn't any mention in it for the original df. If you wanted to dynamically add them you might need to rethink the structure of your df.
CodePudding user response:
Try this:
midx = pd.MultiIndex.from_product([player_data['country'].unique(),
player_data['category'].unique(),
player_data['gender'].unique()])
player_data.groupby(['country', 'category', 'gender', 'age_group'])['points']\
.sum()\
.unstack(fill_value=0)\
.reindex(midx, fill_value=0)
Output:
age_group 20 30 40
Austria basic male 580 0 0
female 180 0 0
pro male 0 0 0
female 0 0 0
advanced male 0 0 0
female 0 0 320
intermidiate male 240 0 0
female 0 0 0
Germany basic male 0 0 0
female 180 0 0
pro male 0 480 0
female 0 0 0
advanced male 0 0 0
female 0 0 0
intermidiate male 0 900 0
female 0 0 0
Sweden basic male 0 0 0
female 0 0 0
pro male 0 0 0
female 0 0 960
advanced male 0 0 0
female 0 330 0
intermidiate male 0 0 0
female 0 0 0
CodePudding user response:
This works. Although the loops are a pretty janky way of sorting the zeros.
df = player_data.groupby(["country", "category", "gender", "age_group"]).points.sum().reset_index()
df = df.pivot_table(values='points', index=['country', 'category', 'gender'], columns='age_group', fill_value=0)
for country in player_data.country.unique():
for category in player_data.category.unique():
for gender in player_data.gender.unique():
if (country, category, gender) not in df.index:
df.loc[(country, category, gender)] = np.zeros(len(player_data.age_group.unique()), dtype=int)
df = df.sort_values(['country', 'category', 'gender']).reset_index()
Output:
age_group country category gender 20 30 40
0 Austria advanced female 0 0 320
1 Austria advanced male 0 0 0
2 Austria basic female 180 0 0
3 Austria basic male 580 0 0
4 Austria intermidiate female 0 0 0
5 Austria intermidiate male 240 0 0
6 Austria pro female 0 0 0
7 Austria pro male 0 0 0
8 Germany advanced female 0 0 0
9 Germany advanced male 0 0 0
10 Germany basic female 180 0 0
11 Germany basic male 0 0 0
12 Germany intermidiate female 0 0 0
13 Germany intermidiate male 0 900 0
14 Germany pro female 0 0 0
15 Germany pro male 0 480 0
16 Sweden advanced female 0 3...