Pandas df group by count elements-CodePudding

My dataframe looks like this.

# initialize list of lists
data = [[1998, 1998,2002,2003], [2001, 1999,1993,2003], [1998, 1999,2003,1994], [1998,1997,2003,1993], [1999,2001,1996, 1999]]
     
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])

I would like to count for each date the number of occurrences in %. Such that the dataframe looks like this:

    1997    1998    1999
A   20%     80%     100%
B   30%     10%     0%
C   70%     10%     0%

I tried to use Pandas group-by.

CodePudding user response：

The logic is not fully clear (since it looks that the provided output is not the real one corresponding to the provided input), but here are some approaches:

using `crosstab`

Percent per year

df2 = df.melt(value_name='year')

df2 = pd.crosstab(df2['variable'], df2['year'], normalize='columns').mul(100)

# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum()).mul(100)

Output:

year      1993   1994   1996   1997  1998  1999  2001   2002  2003
variable                                                          
A          0.0    0.0    0.0    0.0  75.0  25.0  50.0    0.0   0.0
B          0.0    0.0    0.0  100.0  25.0  50.0  50.0    0.0   0.0
C         50.0    0.0  100.0    0.0   0.0   0.0   0.0  100.0  50.0
D         50.0  100.0    0.0    0.0   0.0  25.0   0.0    0.0  50.0

Percent per variable

df2 = df.melt(value_name='year')
pd.crosstab(df2['variable'], df2['year'], normalize='index').mul(100)

# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum(1), axis=0).mul(100)

Output:

year      1993  1994  1996  1997  1998  1999  2001  2002  2003
variable                                                      
A          0.0   0.0   0.0   0.0  60.0  20.0  20.0   0.0   0.0
B          0.0   0.0   0.0  20.0  20.0  40.0  20.0   0.0   0.0
C         20.0   0.0  20.0   0.0   0.0   0.0   0.0  20.0  40.0
D         20.0  20.0   0.0   0.0   0.0  20.0   0.0   0.0  40.0

using `groupby`

(df.stack()
 .groupby(level=1)
 .apply(lambda s: s.value_counts(normalize=True))
 .unstack(fill_value=0)
 .mul(100)
 )

Output:

   1993  1994  1996  1997  1998  1999  2001  2002  2003
A   0.0   0.0   0.0   0.0  60.0  20.0  20.0   0.0   0.0
B   0.0   0.0   0.0  20.0  20.0  40.0  20.0   0.0   0.0
C  20.0   0.0  20.0   0.0   0.0   0.0   0.0  20.0  40.0
D  20.0  20.0   0.0   0.0   0.0  20.0   0.0   0.0  40.0

CodePudding user response：

Another option could be the following:

# getting value_counts for each column
df2 = pd.concat([df[col].value_counts(normalize=True) for col in df.columns], axis=1)

# filling null values with 0 
df2.fillna(0, inplace=True)

# transforming to string and adding %
df2 = df2.astype('int').astype('str') '%'

# getting your output
df2.loc['1997':'1999', 'A':'C'].T

Output:

    1997    1998    1999
A   20%     80%     100%
B   30%     10%     0%
C   70%     10%     0%

CodePudding user response：

melt groupby unstack

(df.melt().groupby(['variable', 'value']).size() 
 / df.melt().groupby('value').size()).unstack(1)

Out[1]: 
value     1993  1994  1996  1997  1998  1999  2001  2002  2003
variable                                                      
A          NaN   NaN   NaN   NaN  0.75  0.25   0.5   NaN   NaN
B          NaN   NaN   NaN   1.0  0.25  0.50   0.5   NaN   NaN
C          0.5   NaN   1.0   NaN   NaN   NaN   NaN   1.0   0.5
D          0.5   1.0   NaN   NaN   NaN  0.25   NaN   NaN   0.5

using crosstab

using groupby

using `crosstab`

using `groupby`