Home > front end >  Count NA and none-NA per group in pandas
Count NA and none-NA per group in pandas

Time:11-10

I assume this is a simple task for pandas but I don't get it.

I have data liket this

  Group   Val
0     A     0
1     A     1
2     A  <NA>
3     A     3
4     B     4
5     B  <NA>
6     B     6
7     B  <NA>

And I want to know the frequency of valid and invalid values in Val per group Group. This is the expected result.

       A   B   Total
Valid  3   2       5
NA     1   2       3

Here is code to generate that sample data.

#!/usr/bin/env python3
import pandas as pd

df = pd.DataFrame({
    'Group': list('AAAABBBB'),
    'Val': range(8)
})

# some values to NA
for idx in [2, 5, 7]:
    df.iloc[idx, 1] = pd.NA

print(df)

What I tried is something with grouping

>>> df.groupby('Group').agg(lambda x: x.isna())
                               Val
Group
A      [False, False, True, False]
B       [False, True, False, True]

>>> df.groupby('Group').apply(lambda x: x.isna())
   Group    Val
0  False  False
1  False  False
2  False   True
3  False  False
4  False  False
5  False   True
6  False  False
7  False   True

CodePudding user response:

You are close with using groupby and isna

new = df.groupby(['Group', df['Val'].isna().replace({True: 'NA', False: 'Valid'})])['Group'].count().unstack(level=0)
new['Total'] = new.sum(axis=1)
print(new)

Group  A  B  Total
Val               
NA     1  2      3
Valid  3  2      5

CodePudding user response:

here is one way to do it

# cross tab to take the summarize
# convert Val to NA or Valid depending on the value
df2=(pd.crosstab(df['Val'].isna().map({False: 'NA', True: 'Valid'}), 
                 df['Group'] )
     .reset_index() 
     .rename_axis(columns=None))

df2['Total']=df2.sum(axis=1, numeric_only=True) # add Total column
out=df2.set_index('Val') # set index to match expected output
out
    A   B   Total
Val             
NA  3   2   5
Valid   1   2   3

if you need both row and column total, then it'll be even simpler with crosstab

df2=(pd.crosstab(df['Val'].isna().map({False: 'NA', True: 'Valid'}), 
                 df['Group'],
                margins=True, margins_name='Total')
Group   A   B   Total
Val             
NA      3   2   5
Valid   1   2   3
Total   4   4   8

CodePudding user response:

Another possible solution, based on pandas.pivot_table and on the following ideas:

  1. Add a new column, status, which contains NA or Valid if the corresponding value is or is not NaN, respectively.

  2. Create a pivot table, using len as aggregation function.

  3. Add the Total column, by summing by rows.

(df.assign(status=np.where(df['Val'].isna(), 'NA', 'Valid'))
 .pivot_table(
     index='status', columns='Group', values='Val', 
     aggfunc=lambda x: len(x))
 .reset_index()
 .rename_axis(None, axis=1)
 .assign(Total = lambda x: x.sum(axis=1)))

Output:

  status  A  B  Total
0     NA  1  2      3
1  Valid  3  2      5
  • Related