Home > front end >  Create a pivot table in pandas while adding up the number of occurrences in a column
Create a pivot table in pandas while adding up the number of occurrences in a column

Time:12-19

Setup

Suppose I have the following dataframe:

    fruit   color   region   subregion
0   banana  yellow  CANADA   TORONTO
1   pear    red     CANADA   MONTREAL
2   banana  red     CANADA   TORONTO
3   banana  yellow  CANADA   TORONTO
4   banana  yellow  CANADA   MONTREAL
5   apple   red     US       GEORGIA
6   banana  red     US       IOWA
7   banana  yellow  CANADA   MONTREAL
8   apple   red     CANADA   MONTREAL
9   banana  yellow  CANADA   MONTREAL

I want to pivot the dataframe to count the number of times each row pair (fruit, color) appears in each region and subregion.

fruit   color   US   CANADA   TORONTO   MONTREAL   GEORGIA   IOWA
apple   red     1    1        0          0         1         0
apple   yellow  0    0        0          0         0         0
pear    red     0    1        0          1         0         0
banana  yellow  0    5        2          3         0         0
banana  red     1    1        2          0         0         1

Attempts

If we ignore the subregion by dropping the column entirely, we can get pretty close by doing:

df.groupby(['fruit', 'color', 'region']).size().unstack(fill_value=0).reset_index()

which gets us:

region  fruit   color   US   CANADA   
0       apple   red     1    1        
1       apple   yellow  0    0        
2       pear    red     0    1        
3       banana  yellow  0    5        
4       banana  red     1    1       

It may not be the most elegant solution (and there is something going in with the indexing here that I don't quite understand) but then you could repeat this process and then join the two tables.

My question is: is there a more elegant, pandas-native way of doing the operation I'm describing?

CodePudding user response:

You can actually use pd.get_dummies for this.

new_df = pd.get_dummies(df.set_index(['fruit', 'color']), prefix='', prefix_sep='').groupby(level=[0, 1]).sum().reset_index()

Output:

>>> new_df
    fruit   color  CANADA  US  GEORGIA  IOWA  MONTREAL  TORONTO
0   apple     red       1   1        1     0         1        0
1  banana     red       1   1        0     1         0        1
2  banana  yellow       5   0        0     0         3        2
3    pear     red       1   0        0     0         1        0

Note that for each extra column you want to groupby (e.g. fruit, color, etc.), you'll need to add another item to the level parameter of the groupby call. So a more dynamic version would be:

cols = ['fruit', 'color']
new_df = pd.get_dummies(df.set_index(cols), prefix='', prefix_sep='').groupby(level=list(range(len(cols)))).sum().reset_index()

CodePudding user response:

A combination of melt and pivot_table is an option:

cols = ['fruit', 'color']
(df.melt(id_vars=cols)
   .pivot_table(index=cols,
                columns='value', 
                aggfunc='size', 
                fill_value=0)
   .rename_axis(columns=None)
   .reset_index()
)

    fruit   color  CANADA  GEORGIA  IOWA  MONTREAL  TORONTO  US
0   apple     red       1        1     0         1        0   1
1  banana     red       1        0     1         0        1   1
2  banana  yellow       5        0     0         3        2   0
3    pear     red       1        0     0         1        0   0

CodePudding user response:

Let's use melt to combine region and subregion into a single column, then groupby and unstack:

(
    df.melt(["fruit", "color"])
    .groupby(["fruit", "color", "value"])
    .size()
    .unstack(fill_value=0)
    .reindex(
        pd.MultiIndex.from_product([df["fruit"].unique(), df["color"].unique()], 
                                  names=['fruit', 'color']),
        fill_value=0,
    )
    .reset_index()
)

Output:

value   fruit   color  CANADA  GEORGIA  IOWA  MONTREAL  TORONTO  US
0      banana  yellow       5        0     0         3        2   0
1      banana     red       1        0     1         0        1   1
2        pear  yellow       0        0     0         0        0   0
3        pear     red       1        0     0         1        0   0
4       apple  yellow       0        0     0         0        0   0
5       apple     red       1        1     0         1        0   1
  • Related