Home > Software design >  Filter Pandas Dataframe Under Multiple Conditions
Filter Pandas Dataframe Under Multiple Conditions

Time:07-12

My current progress

I currently have a pandas Dataframe with 5 different instances

df = {

'Name': ['John', 'Mark', 'Kevin', 'Ron', 'Amira'

'ID': [110,111,112,113,114]

'Job title': ['xox','xoy','xoz','yow','uyt']

'Manager': ['River' 'Trevor', 'John', 'Lydia', 'Connor']

'M2': ['Shaun', 'Mary', 'Ronald', 'Cary', 'Miranda']

'M3': ['Clavis', 'Sharon', 'Randall', 'Mark', Doug']

'M4': ['Pat', 'Karen', 'Brad', 'Chad', 'Anita']

'M5': ['Ty', 'Jared', 'Bill', 'William', 'Bob']

'Location': ['US', 'US', 'JP', 'CN', 'JA']

}

list = ['River', 'Pat', 'Brad', 'William', 'Clogah']

I need to filter and drop all rows in the pandas dataframe that contain 0 values from my list and also those that contain more than one value from my list. In the case above the instances in row 1 and row 2 would be dropped because there's two of the names in the specific row within the list.

IN ROW 1 i.e. (1: 'John', 110, 'xox, 'River', 'Shaun', 'Clavis', 'Pat', 'Ty', 'US'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE BOTH 'River' and 'Pat' are listed in the list

IN ROW 2 i.e. (2: 'Mark', 111, 'xoy, 'Trevor', 'Mary', 'Sharon', 'Karen', 'Jared', 'US'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE BOTH 'Trevor' and 'Jared' are listed in the list

IN ROW 5 i.e. (5: 'Amira', 114, 'uyt', 'Connor', 'Miranda', 'Doug', 'Anita', 'Bob', 'JA'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE the row does not contain any values from my list.

The two other instances would be kept.

Original Printed DF

0: 'Name', 'ID', 'Job title', 'Manager', 'M2', 'M3', 'M4', 'M5', 'Location'

1: 'John', 110, 'xox, 'River', 'Shaun', 'Clavis', 'Pat', 'Ty', 'US'

2: 'Mark', 111, 'xoy, 'Trevor', 'Mary', 'Sharon', 'Karen', 'Jared', 'US'

3: 'Kevin', 112, 'xoz, 'John', 'Ronald', 'Randall', 'Brad', 'Bill', 'JP

4: 'Ron', 113, 'yow', 'Lydia', 'Cary', 'Mark', 'Chad', 'William', 'CN'

5: 'Amira', 114, 'uyt', 'Connor', 'Miranda', 'Doug', 'Anita', 'Bob', 'JA'

Filtered Printed DF

3: 'Kevin', 112, 'xoz, 'John', 'Ronald', 'Randall', 'Brad', 'Bill', 'JP',

4: 'Ron', 113, 'yow', 'Lydia', 'Cary', 'Mark', 'Chad', 'William', 'CN',

The current process only filters out rows that don't contain a value equal to any value in my managers list. I want to keep rows with one manager from the list but not rows without mangers from the lis

CodePudding user response:

Not the prettiest way to achieve this, but this will work:

d = {
    "Name": ["John", "Mark", "Kevin", "Ron", "Amira"],
    "ID": [110, 111, 112, 113, 114],
    "Job title": ["xox", "xoy", "xoz", "yow", "uyt"],
    "M1": ["River", "Trevor", "John", "Lydia", "Connor"],
    "M2": ["Shaun", "Mary", "Ronald", "Cary", "Miranda"],
    "M3": ["Clavis", "Sharon", "Randall", "Mark", "Doug"],
    "M4": ["Pat", "Karen", "Brad", "Chad", "Anita"],
    "M5": ["Ty", "Jared", "Bill", "William", "Bob"],
    "Location": ["US", "US", "JP", "CN", "JA"],
}
df = pd.DataFrame(d)

# Isolate managers in their own DataFrame
managers = ["River", "Pat", "Trevor", "Jared", "Connor"]
df_managers = df[["M1", "M2", "M3", "M4", "M5"]]

# Assess any one employee has less than two managers and isolate those employees
less_than_two_managers = []
for i in range(df_managers.shape[0]):
    if len(set(df_managers.iloc[i]).intersection(set(managers))) < 2:
        less_than_two_managers.append(True)
    else:
        less_than_two_managers.append(False)

df["LT two managers"] = less_than_two_managers
df[df["LT two managers"] == True]

CodePudding user response:

here you go:

import pandas as pd

df = pd.DataFrame({'Name': ['John', 'Mark', 'Kevin', 'Ron', 'Amira'],
                   'ID': [110, 111, 112, 113, 114],
                   'Job title': ['xox', 'xoy', 'xoz', 'yow', 'uyt'],
                   'Manager': ['River', 'Trevor', 'John', 'Lydia', 'Connor'],
                   'M2': ['Shaun', 'Mary', 'Ronald', 'Cary', 'Miranda'],
                   'M3': ['Clavis', 'Sharon', 'Randall', 'Mark', 'Doug'],
                   'M4': ['Pat', 'Karen', 'Brad', 'Chad', 'Anita'],
                   'M5': ['Ty', 'Jared', 'Bill', 'William', 'Bob'],
                   'Location': ['US', 'US', 'JP', 'CN', 'JA']}
                  )
managers = ['River', 'Pat', 'Trevor', 'Jared', 'Connor']

mask = df.applymap(lambda x: x in managers)
filtered_df = df[mask.values.sum(axis=1) < 2]

print(filtered_df)

to filter also the 0 (so only 1 manager will stay):

filtered_df = df[mask.values.sum(axis=1) == 1]

CodePudding user response:

Vectorial solution using a mask:

m = (df.filter(regex=r'^M')
       .apply(lambda s: s.isin(lst))
       .sum(1).eq(1)
    )

out = df.loc[m]

Output:

    Name   ID Job title Manager      M2       M3    M4       M5 Location
2  Kevin  112       xoz    John  Ronald  Randall  Brad     Bill       JP
3    Ron  113       yow   Lydia    Cary     Mark  Chad  William       CN
  • Related