Home > Software engineering >  Filter pandas DataFrame with condition
Filter pandas DataFrame with condition

Time:05-04

I am trying to filter a pandas DataFrame with a specific condition.

import pandas as pd

df = pd.DataFrame({
                   'name': ['A','A','C','C','E','E'],
                   'cat': [1, 1, 1, 0, 2, 3]
                  })

I'd like to filter by cat == 1. df[df['cat'] == 1]

However, if 1 doesn't exist as a cat for a name, then get the first cat. In this case, A and C will be filtered by 1, for E, it would be 2.

Expected output:

name cat
   A   1
   C   1
   E   2

Second part to this question:

I have additional columns that need to be retained in the DataFrame.

df = pd.DataFrame({
                       'name': ['A','A','C','C','E','E'],
                       'lbl': ['z','z','y','y','u','u'],
                       'cat': [1, 1, 1, 0, 2, 3],
                       'val1': [10, 20, 5, 10, 10, 0],
                       'yr': [90, 91, 90, 90, 90, 90]
                  })

The categorical variables will be part of the groupby, however, for numerical variables, I need to apply statistical functions / aggregations. mean for val1 and mode for yr.

Expected output:

name lbl cat val1  yr
A     z   1   15   90
C     y   1   5    90
E     u   2   10   90

If there is a tie in mode, as is for rows with name A, select the first value.

CodePudding user response:

First we group based on the name column, Then we apply a lambda function that check the values of the cat column for each group if there is a 1 there or it will return the first value of the group. Use:

df.groupby('name')['cat'].apply(lambda x: 1 if 1 in x.values else x.values[0]).reset_index()

Output:

enter image description here

CodePudding user response:

Another version:

out = df.sort_values(
    "cat", kind="stable", key=lambda x: x != 1
).drop_duplicates("name")

print(out)

Prints:

  name  cat
0    A    1
2    C    1
4    E    2

CodePudding user response:

You can groupby name column then apply a customized function to get desired rows from each group.

out = (df.groupby('name')
       .apply(lambda g: g[g['cat'].eq(1)] if g['cat'].eq(1).any() else g.head(1))
       .reset_index(drop=True))
print(out)

  name  cat
0    A    1
1    C    1
2    E    2
  • Related