Bins, Groupby and porcentage of subtotals in each group (Pandas)-CodePudding

I want to get the porcentage for each row from the subgroup total. First, I created bins for the dataframe, then I do the group by.

I have the following dataframe:

data = pd.DataFrame({'barrio':['Almagro','Palermo','Almagro','Almagro','Palermo','Palermo','Almagro','Almagro'],
                     'fuente':['A','A','B','B','B','A','B','A'], 
                     'valor':[1,6,5,3,5,10,8,4]})

bins = np.arange(0,data['valor'].max()   1, 2)
labels = ['-'.join(map(str,(x,y))) for x, y in zip(bins[:-1], bins[1:])]
data['bins'] = pd.cut(data['valor'], bins=bins,labels = labels, include_lowest=True)

data = data.groupby(['barrio','fuente','bins']).agg({'valor':'count'})
data = data.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))

The expected result would be:

                        valor
barrio   fuente bins    
Almagro     A   0-2     50.00
                2-4     50.00
                4-6     0.00
                6-8     0.00
                8-10    0.00
            B   0-2     0.00
                2-4     33.00
                4-6     33.00
                6-8     33.00
                8-10    0.00
Palermo     A   0-2     0.00
                2-4     0.00
                4-6     50.00
                6-8     0.00
                8-10    50.00
            B   0-2     0.00
                2-4     0.00
                4-6     100.00
                6-8     0.00
                8-10    0.00

But the result I get is:

            valor
barrio  fuente  bins    
Almagro     A   0-2     20.00
                2-4     20.00
                4-6     0.00
                6-8     0.00
                8-10    0.00
B               0-2     0.00
                2-4     20.00
                4-6     20.00
                6-8     20.00
                8-10    0.00
Palermo     A   0-2     0.00
                2-4     0.00
                4-6     33.33
                6-8     0.00
                8-10    33.33
B               0-2     0.00
                2-4     0.00
                4-6     33.33
                6-8     0.00
                8-10    0.00

How I can solve this problem?

thanks!!!

CodePudding user response：

You can use value_counts with normalize:

data.groupby(['barrio', 'fuente'])['bins'].value_counts(normalize=True)

Output:

barrio   fuente  bins
Almagro  A       0-2     0.500000
                 2-4     0.500000
         B       2-4     0.333333
                 4-6     0.333333
                 6-8     0.333333
Palermo  A       4-6     0.500000
                 8-10    0.500000
         B       4-6     1.000000
Name: bins, dtype: float64

Or you can use crosstab, which uses the above in the background:

pd.crosstab([data['barrio'],data['fuente']], 
            data['bins'], normalize='index'
           ).stack()

Output:

barrio   fuente  bins
Almagro  A       0-2     0.500000
                 2-4     0.500000
                 4-6     0.000000
                 6-8     0.000000
                 8-10    0.000000
         B       0-2     0.000000
                 2-4     0.333333
                 4-6     0.333333
                 6-8     0.333333
                 8-10    0.000000
Palermo  A       0-2     0.000000
                 2-4     0.000000
                 4-6     0.500000
                 6-8     0.000000
                 8-10    0.500000
         B       0-2     0.000000
                 2-4     0.000000
                 4-6     1.000000
                 6-8     0.000000
                 8-10    0.000000
dtype: float64