Home > Mobile >  Filtering data-frame columns using regex, then using .groupby to calculate sum
Filtering data-frame columns using regex, then using .groupby to calculate sum

Time:12-09

I have a dataframe which I want to group, filter columns by regex, and then sum.

My code looks like this:

import pandas as pd

df = pd.DataFrame({'ID':[1,1,2,2,3,3], 
                   'Invasive' : [12,1,1,0,1,0], 
                   'invasive': [1,4,5,3,4,6],
                   'Wild':[4,7,1,0,0,0],
                   'wild':[0,0,9,8,3,2],
                   'Crop':[0,0,0,0,0,0],
                   'Crop_2':[2,3,2,2,1,2]})

df.groupby(['ID']).filter(regex='(Invasive)|(invasive)|(Wild)|(wild)').sum()

The error message I get is:

DataFrameGroupBy.filter() missing 1 required positional argument: 'func'

I get the same Err message if groupby comes after filter

Why does this happen? Where do I input the func argument?

EDIT:

My Expected output is one column that has summed across the filtered columns and is grouped by ID. E.g.:

   ID  Output
0   1      29
1   2      27
2   3      16

CodePudding user response:

What you want to do doesn't make sense, groupby.filter is to filter rows, not to be confused with DataFrame.filter.

You likely want to filter the columns, then to aggregate:

df.filter(regex='(?i)(Invasive|Wild)').groupby(df['ID']).sum()

NB. I replaced (Invasive)|(invasive)|(Wild)|(wild) by (?i)(Invasive|Wild), which means 'InvasiveORWild` independently of the case.

Output:

    Invasive  invasive  Wild  wild
ID                                
1         13         5    11     0
2          1         8     1    17
3          1        10     0     5

edit:

the output that you show needs a further summation per row:

out = (df.filter(regex='(?i)(Invasive|Wild)')
         .groupby(df['ID']).sum()
         .sum(axis=1)
         .reset_index(name='Output')
      )

# or with summation before:
out = (df.filter(regex='(?i)(Invasive|Wild)')
         .sum(axis=1)
         .groupby(df['ID']).sum()
         .reset_index(name='Output')
      )

Output:

   ID  Output
0   1      29
1   2      27
2   3      16
  • Related