Home > Software design >  filtering a dataframe using another dataframe
filtering a dataframe using another dataframe

Time:06-23

data = {'a':['a','b','c','d','e','f','g'],
        'b':['Y','N','Y','Y','Y','N','Y'],
        'c':['Qualified','Unqualified','Qualified','Unqualified','Qualified','Unqualified','Qualified']}
df = pd.DataFrame(data)

df_para = {'Y/N':['','y','n'],
        'Q/U':['unqualified','','unqualified']}
df_para = pd.DataFrame(df_para)

I would like to filter the df using df_para, My code is:

df_output = pd.DataFrame()

for para in df_para.iterrows():
    df_result = df
     # filter Q/U
    if '' not in df_para['Q/U']:
        mask_qu = df_result['c'].str.lower().isin(df_para['Q/U'])
        df_result = df_result.loc[(mask_qu)]
        
    # filter Y/N
    if '' not in df_para['Y/N']:
        mask_yn = df_result['b'].str.lower().isin(df_para['Y/N'])
        df_result = df_result.loc[(mask_yn)]

    df_output = df_output.append(df_result)

If I use my code, it returns all rows within df three times. However, the df_output should be like:

   a   b   c
1   b   N   Unqualified
3   d   Y   Unqualified
5   f   N   Unqualified
0   a   Y   Qualified
2   c   Y   Qualified
3   d   Y   Unqualified
4   e   Y   Qualified
6   g   Y   Qualified
1   b   N   Unqualified
5   f   N   Unqualified

How could I fix it?

CodePudding user response:

Reason is in operator test indices:

Using the Python in operator on a Series tests for membership in the index, not membership among the values.

If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like.

#pairs for filtering
cols = [('c','Q/U'), ('b','Y/N')]

#for each unique value in df_para filter rows in list
dfs = [df[df[a].str.lower().eq(x)] for a, b in cols for x in df_para[b].unique()]

#join subDataFrames
df_out = pd.concat(dfs)
print (df_out)
   a  b            c
1  b  N  Unqualified
3  d  Y  Unqualified
5  f  N  Unqualified
0  a  Y    Qualified
2  c  Y    Qualified
3  d  Y  Unqualified
4  e  Y    Qualified
6  g  Y    Qualified
1  b  N  Unqualified
5  f  N  Unqualified

CodePudding user response:

try this:

import pandas as pd
import numpy as np

data = {'a':['a','b','c','d','e','f','g'],
        'b':['Y','N','Y','Y','Y','N','Y'],
        'c':['Qualified','Unqualified','Qualified','Unqualified','Qualified','Qualified','Unqualified']}
df = pd.DataFrame(data)


df_result = df[df["c"] == "Unqualified"]
print(df_result)
print(type(df_result))
  • Related