Setup
Suppose I have the following dataframe:
fruit color region subregion
0 banana yellow CANADA TORONTO
1 pear red CANADA MONTREAL
2 banana red CANADA TORONTO
3 banana yellow CANADA TORONTO
4 banana yellow CANADA MONTREAL
5 apple red US GEORGIA
6 banana red US IOWA
7 banana yellow CANADA MONTREAL
8 apple red CANADA MONTREAL
9 banana yellow CANADA MONTREAL
I want to pivot the dataframe to count the number of times each row pair (fruit, color)
appears in each region and subregion.
fruit color US CANADA TORONTO MONTREAL GEORGIA IOWA
apple red 1 1 0 0 1 0
apple yellow 0 0 0 0 0 0
pear red 0 1 0 1 0 0
banana yellow 0 5 2 3 0 0
banana red 1 1 2 0 0 1
Attempts
If we ignore the subregion
by dropping the column entirely, we can get pretty close by doing:
df.groupby(['fruit', 'color', 'region']).size().unstack(fill_value=0).reset_index()
which gets us:
region fruit color US CANADA
0 apple red 1 1
1 apple yellow 0 0
2 pear red 0 1
3 banana yellow 0 5
4 banana red 1 1
It may not be the most elegant solution (and there is something going in with the indexing here that I don't quite understand) but then you could repeat this process and then join the two tables.
My question is: is there a more elegant, pandas-native way of doing the operation I'm describing?
CodePudding user response:
You can actually use pd.get_dummies
for this.
new_df = pd.get_dummies(df.set_index(['fruit', 'color']), prefix='', prefix_sep='').groupby(level=[0, 1]).sum().reset_index()
Output:
>>> new_df
fruit color CANADA US GEORGIA IOWA MONTREAL TORONTO
0 apple red 1 1 1 0 1 0
1 banana red 1 1 0 1 0 1
2 banana yellow 5 0 0 0 3 2
3 pear red 1 0 0 0 1 0
Note that for each extra column you want to groupby (e.g. fruit
, color
, etc.), you'll need to add another item to the level
parameter of the groupby
call. So a more dynamic version would be:
cols = ['fruit', 'color']
new_df = pd.get_dummies(df.set_index(cols), prefix='', prefix_sep='').groupby(level=list(range(len(cols)))).sum().reset_index()
CodePudding user response:
A combination of melt
and pivot_table
is an option:
cols = ['fruit', 'color']
(df.melt(id_vars=cols)
.pivot_table(index=cols,
columns='value',
aggfunc='size',
fill_value=0)
.rename_axis(columns=None)
.reset_index()
)
fruit color CANADA GEORGIA IOWA MONTREAL TORONTO US
0 apple red 1 1 0 1 0 1
1 banana red 1 0 1 0 1 1
2 banana yellow 5 0 0 3 2 0
3 pear red 1 0 0 1 0 0
CodePudding user response:
Let's use melt to combine region and subregion into a single column, then groupby and unstack:
(
df.melt(["fruit", "color"])
.groupby(["fruit", "color", "value"])
.size()
.unstack(fill_value=0)
.reindex(
pd.MultiIndex.from_product([df["fruit"].unique(), df["color"].unique()],
names=['fruit', 'color']),
fill_value=0,
)
.reset_index()
)
Output:
value fruit color CANADA GEORGIA IOWA MONTREAL TORONTO US
0 banana yellow 5 0 0 3 2 0
1 banana red 1 0 1 0 1 1
2 pear yellow 0 0 0 0 0 0
3 pear red 1 0 0 1 0 0
4 apple yellow 0 0 0 0 0 0
5 apple red 1 1 0 1 0 1