Filter Pandas Dataframe Under Multiple Conditions-CodePudding

I currently have a pandas Dataframe with 5 different instances

df = {

'Name': ['John', 'Mark', 'Kevin', 'Ron', 'Amira'

'ID': [110,111,112,113,114]

'Job title': ['xox','xoy','xoz','yow','uyt']

'Manager': ['River' 'Trevor', 'John', 'Lydia', 'Connor']

'M2': ['Shaun', 'Mary', 'Ronald', 'Cary', 'Miranda']

'M3': ['Clavis', 'Sharon', 'Randall', 'Mark', Doug']

'M4': ['Pat', 'Karen', 'Brad', 'Chad', 'Anita']

'M5': ['Ty', 'Jared', 'Bill', 'William', 'Bob']

'Location': ['US', 'US', 'JP', 'CN', 'JA']

}

list = ['River', 'Pat', 'Brad', 'William', 'Clogah']

I need to filter and drop all rows in the pandas dataframe that contain 0 values from my list and also those that contain more than one value from my list. In the case above the instances in row 1 and row 2 would be dropped because there's two of the names in the specific row within the list.

IN ROW 1 i.e. (1: 'John', 110, 'xox, 'River', 'Shaun', 'Clavis', 'Pat', 'Ty', 'US'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE BOTH 'River' and 'Pat' are listed in the list

IN ROW 2 i.e. (2: 'Mark', 111, 'xoy, 'Trevor', 'Mary', 'Sharon', 'Karen', 'Jared', 'US'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE BOTH 'Trevor' and 'Jared' are listed in the list

IN ROW 5 i.e. (5: 'Amira', 114, 'uyt', 'Connor', 'Miranda', 'Doug', 'Anita', 'Bob', 'JA'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE the row does not contain any values from my list.

The two other instances would be kept.

Original Printed DF

0: 'Name', 'ID', 'Job title', 'Manager', 'M2', 'M3', 'M4', 'M5', 'Location'

1: 'John', 110, 'xox, 'River', 'Shaun', 'Clavis', 'Pat', 'Ty', 'US'

2: 'Mark', 111, 'xoy, 'Trevor', 'Mary', 'Sharon', 'Karen', 'Jared', 'US'

3: 'Kevin', 112, 'xoz, 'John', 'Ronald', 'Randall', 'Brad', 'Bill', 'JP

4: 'Ron', 113, 'yow', 'Lydia', 'Cary', 'Mark', 'Chad', 'William', 'CN'

5: 'Amira', 114, 'uyt', 'Connor', 'Miranda', 'Doug', 'Anita', 'Bob', 'JA'

Filtered Printed DF

3: 'Kevin', 112, 'xoz, 'John', 'Ronald', 'Randall', 'Brad', 'Bill', 'JP',

4: 'Ron', 113, 'yow', 'Lydia', 'Cary', 'Mark', 'Chad', 'William', 'CN',

The current process only filters out rows that don't contain a value equal to any value in my managers list. I want to keep rows with one manager from the list but not rows without mangers from the lis

CodePudding user response：

Not the prettiest way to achieve this, but this will work:

d = {
    "Name": ["John", "Mark", "Kevin", "Ron", "Amira"],
    "ID": [110, 111, 112, 113, 114],
    "Job title": ["xox", "xoy", "xoz", "yow", "uyt"],
    "M1": ["River", "Trevor", "John", "Lydia", "Connor"],
    "M2": ["Shaun", "Mary", "Ronald", "Cary", "Miranda"],
    "M3": ["Clavis", "Sharon", "Randall", "Mark", "Doug"],
    "M4": ["Pat", "Karen", "Brad", "Chad", "Anita"],
    "M5": ["Ty", "Jared", "Bill", "William", "Bob"],
    "Location": ["US", "US", "JP", "CN", "JA"],
}
df = pd.DataFrame(d)

# Isolate managers in their own DataFrame
managers = ["River", "Pat", "Trevor", "Jared", "Connor"]
df_managers = df[["M1", "M2", "M3", "M4", "M5"]]

# Assess any one employee has less than two managers and isolate those employees
less_than_two_managers = []
for i in range(df_managers.shape[0]):
    if len(set(df_managers.iloc[i]).intersection(set(managers))) < 2:
        less_than_two_managers.append(True)
    else:
        less_than_two_managers.append(False)

df["LT two managers"] = less_than_two_managers
df[df["LT two managers"] == True]

CodePudding user response：

here you go:

import pandas as pd

df = pd.DataFrame({'Name': ['John', 'Mark', 'Kevin', 'Ron', 'Amira'],
                   'ID': [110, 111, 112, 113, 114],
                   'Job title': ['xox', 'xoy', 'xoz', 'yow', 'uyt'],
                   'Manager': ['River', 'Trevor', 'John', 'Lydia', 'Connor'],
                   'M2': ['Shaun', 'Mary', 'Ronald', 'Cary', 'Miranda'],
                   'M3': ['Clavis', 'Sharon', 'Randall', 'Mark', 'Doug'],
                   'M4': ['Pat', 'Karen', 'Brad', 'Chad', 'Anita'],
                   'M5': ['Ty', 'Jared', 'Bill', 'William', 'Bob'],
                   'Location': ['US', 'US', 'JP', 'CN', 'JA']}
                  )
managers = ['River', 'Pat', 'Trevor', 'Jared', 'Connor']

mask = df.applymap(lambda x: x in managers)
filtered_df = df[mask.values.sum(axis=1) < 2]

print(filtered_df)

to filter also the 0 (so only 1 manager will stay):

filtered_df = df[mask.values.sum(axis=1) == 1]

CodePudding user response：

Vectorial solution using a mask:

m = (df.filter(regex=r'^M')
       .apply(lambda s: s.isin(lst))
       .sum(1).eq(1)
    )

out = df.loc[m]

Output:

    Name   ID Job title Manager      M2       M3    M4       M5 Location
2  Kevin  112       xoz    John  Ronald  Randall  Brad     Bill       JP
3    Ron  113       yow   Lydia    Cary     Mark  Chad  William       CN