My dataframe looks like this.
# initialize list of lists
data = [[1998, 1998,2002,2003], [2001, 1999,1993,2003], [1998, 1999,2003,1994], [1998,1997,2003,1993], [1999,2001,1996, 1999]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
I would like to count for each date the number of occurrences in %. Such that the dataframe looks like this:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%
I tried to use Pandas group-by.
CodePudding user response:
The logic is not fully clear (since it looks that the provided output is not the real one corresponding to the provided input), but here are some approaches:
using crosstab
Percent per year
df2 = df.melt(value_name='year')
df2 = pd.crosstab(df2['variable'], df2['year'], normalize='columns').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum()).mul(100)
Output:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 75.0 25.0 50.0 0.0 0.0
B 0.0 0.0 0.0 100.0 25.0 50.0 50.0 0.0 0.0
C 50.0 0.0 100.0 0.0 0.0 0.0 0.0 100.0 50.0
D 50.0 100.0 0.0 0.0 0.0 25.0 0.0 0.0 50.0
Percent per variable
df2 = df.melt(value_name='year')
pd.crosstab(df2['variable'], df2['year'], normalize='index').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum(1), axis=0).mul(100)
Output:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0
using groupby
(df.stack()
.groupby(level=1)
.apply(lambda s: s.value_counts(normalize=True))
.unstack(fill_value=0)
.mul(100)
)
Output:
1993 1994 1996 1997 1998 1999 2001 2002 2003
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0
CodePudding user response:
Another option could be the following:
# getting value_counts for each column
df2 = pd.concat([df[col].value_counts(normalize=True) for col in df.columns], axis=1)
# filling null values with 0
df2.fillna(0, inplace=True)
# transforming to string and adding %
df2 = df2.astype('int').astype('str') '%'
# getting your output
df2.loc['1997':'1999', 'A':'C'].T
Output:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%
CodePudding user response:
melt groupby unstack
(df.melt().groupby(['variable', 'value']).size()
/ df.melt().groupby('value').size()).unstack(1)
Out[1]:
value 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A NaN NaN NaN NaN 0.75 0.25 0.5 NaN NaN
B NaN NaN NaN 1.0 0.25 0.50 0.5 NaN NaN
C 0.5 NaN 1.0 NaN NaN NaN NaN 1.0 0.5
D 0.5 1.0 NaN NaN NaN 0.25 NaN NaN 0.5