Home > Blockchain >  Pandas df group by count elements
Pandas df group by count elements

Time:05-12

My dataframe looks like this.

# initialize list of lists
data = [[1998, 1998,2002,2003], [2001, 1999,1993,2003], [1998, 1999,2003,1994], [1998,1997,2003,1993], [1999,2001,1996, 1999]]
     
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])

I would like to count for each date the number of occurrences in %. Such that the dataframe looks like this:

    1997    1998    1999
A   20%     80%     100%
B   30%     10%     0%
C   70%     10%     0%

I tried to use Pandas group-by.

CodePudding user response:

The logic is not fully clear (since it looks that the provided output is not the real one corresponding to the provided input), but here are some approaches:

using crosstab

Percent per year

df2 = df.melt(value_name='year')

df2 = pd.crosstab(df2['variable'], df2['year'], normalize='columns').mul(100)

# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum()).mul(100)

Output:

year      1993   1994   1996   1997  1998  1999  2001   2002  2003
variable                                                          
A          0.0    0.0    0.0    0.0  75.0  25.0  50.0    0.0   0.0
B          0.0    0.0    0.0  100.0  25.0  50.0  50.0    0.0   0.0
C         50.0    0.0  100.0    0.0   0.0   0.0   0.0  100.0  50.0
D         50.0  100.0    0.0    0.0   0.0  25.0   0.0    0.0  50.0

Percent per variable

df2 = df.melt(value_name='year')
pd.crosstab(df2['variable'], df2['year'], normalize='index').mul(100)

# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum(1), axis=0).mul(100)

Output:

year      1993  1994  1996  1997  1998  1999  2001  2002  2003
variable                                                      
A          0.0   0.0   0.0   0.0  60.0  20.0  20.0   0.0   0.0
B          0.0   0.0   0.0  20.0  20.0  40.0  20.0   0.0   0.0
C         20.0   0.0  20.0   0.0   0.0   0.0   0.0  20.0  40.0
D         20.0  20.0   0.0   0.0   0.0  20.0   0.0   0.0  40.0

using groupby

(df.stack()
 .groupby(level=1)
 .apply(lambda s: s.value_counts(normalize=True))
 .unstack(fill_value=0)
 .mul(100)
 )

Output:

   1993  1994  1996  1997  1998  1999  2001  2002  2003
A   0.0   0.0   0.0   0.0  60.0  20.0  20.0   0.0   0.0
B   0.0   0.0   0.0  20.0  20.0  40.0  20.0   0.0   0.0
C  20.0   0.0  20.0   0.0   0.0   0.0   0.0  20.0  40.0
D  20.0  20.0   0.0   0.0   0.0  20.0   0.0   0.0  40.0

CodePudding user response:

Another option could be the following:

# getting value_counts for each column
df2 = pd.concat([df[col].value_counts(normalize=True) for col in df.columns], axis=1)

# filling null values with 0 
df2.fillna(0, inplace=True)

# transforming to string and adding %
df2 = df2.astype('int').astype('str') '%'

# getting your output
df2.loc['1997':'1999', 'A':'C'].T

Output:

    1997    1998    1999
A   20%     80%     100%
B   30%     10%     0%
C   70%     10%     0%

CodePudding user response:

melt groupby unstack

(df.melt().groupby(['variable', 'value']).size() 
 / df.melt().groupby('value').size()).unstack(1)

Out[1]: 
value     1993  1994  1996  1997  1998  1999  2001  2002  2003
variable                                                      
A          NaN   NaN   NaN   NaN  0.75  0.25   0.5   NaN   NaN
B          NaN   NaN   NaN   1.0  0.25  0.50   0.5   NaN   NaN
C          0.5   NaN   1.0   NaN   NaN   NaN   NaN   1.0   0.5
D          0.5   1.0   NaN   NaN   NaN  0.25   NaN   NaN   0.5
  • Related