I assume this is a simple task for pandas
but I don't get it.
I have data liket this
Group Val
0 A 0
1 A 1
2 A <NA>
3 A 3
4 B 4
5 B <NA>
6 B 6
7 B <NA>
And I want to know the frequency of valid and invalid values in Val
per group Group
. This is the expected result.
A B Total
Valid 3 2 5
NA 1 2 3
Here is code to generate that sample data.
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame({
'Group': list('AAAABBBB'),
'Val': range(8)
})
# some values to NA
for idx in [2, 5, 7]:
df.iloc[idx, 1] = pd.NA
print(df)
What I tried is something with grouping
>>> df.groupby('Group').agg(lambda x: x.isna())
Val
Group
A [False, False, True, False]
B [False, True, False, True]
>>> df.groupby('Group').apply(lambda x: x.isna())
Group Val
0 False False
1 False False
2 False True
3 False False
4 False False
5 False True
6 False False
7 False True
CodePudding user response:
You are close with using groupby and isna
new = df.groupby(['Group', df['Val'].isna().replace({True: 'NA', False: 'Valid'})])['Group'].count().unstack(level=0)
new['Total'] = new.sum(axis=1)
print(new)
Group A B Total
Val
NA 1 2 3
Valid 3 2 5
CodePudding user response:
here is one way to do it
# cross tab to take the summarize
# convert Val to NA or Valid depending on the value
df2=(pd.crosstab(df['Val'].isna().map({False: 'NA', True: 'Valid'}),
df['Group'] )
.reset_index()
.rename_axis(columns=None))
df2['Total']=df2.sum(axis=1, numeric_only=True) # add Total column
out=df2.set_index('Val') # set index to match expected output
out
A B Total
Val
NA 3 2 5
Valid 1 2 3
if you need both row and column total, then it'll be even simpler with crosstab
df2=(pd.crosstab(df['Val'].isna().map({False: 'NA', True: 'Valid'}),
df['Group'],
margins=True, margins_name='Total')
Group A B Total
Val
NA 3 2 5
Valid 1 2 3
Total 4 4 8
CodePudding user response:
Another possible solution, based on pandas.pivot_table
and on the following ideas:
Add a new column,
status
, which containsNA
orValid
if the corresponding value is or is notNaN
, respectively.Create a pivot table, using
len
as aggregation function.Add the
Total
column, by summing by rows.
(df.assign(status=np.where(df['Val'].isna(), 'NA', 'Valid'))
.pivot_table(
index='status', columns='Group', values='Val',
aggfunc=lambda x: len(x))
.reset_index()
.rename_axis(None, axis=1)
.assign(Total = lambda x: x.sum(axis=1)))
Output:
status A B Total
0 NA 1 2 3
1 Valid 3 2 5