Home > Back-end >  how to select rows based on two condition pandas
how to select rows based on two condition pandas

Time:10-23

I want to filter rows that are in df_1 that match the following two condition

  • Row from df_1 should not be in df_2

    OR

  • If row from df_1 is in df_2 then it must have a value Yes in df_2

Code I tried, that does not work

import pandas as pd

df_1 = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
df_2 = pd.DataFrame({'a': [2, 4, 6, 8, 10], 'b': ['Yes', 'No', 'Yes', 'No', 'No']})

df = df_1[(~df_1.a.isin(df_2.a)) | (df_2.b=='Yes')]

Output

   a
0  1
2  3
4  5
6  7
8  9

Expected Output

   a
0  1
1  2
2  3
3  6
4  5
6  7
8  9

Explanation Rows 1, 3, 5, 7, 9 are not in df_2 so they are part of the output Rows 2 and 6 are in df_2 but the have column b as Yes so they are part of the output

CodePudding user response:

I think you can create two frames and concat it as below.

import pandas as pd

df_1 = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
df_2 = pd.DataFrame({'a': [2, 4, 6, 8, 10], 'b': ['Yes', 'No', 'Yes', 'No', 'No']})

df = [df_1[(~df_1.a.isin(df_2.a))]['a'],df_2[df_2.b=='Yes']['a']]
result = pd.concat(df).sort_values()
print(result)

output

enter image description here

CodePudding user response:

The problem in your calculation stems from the fact that df_1 and df_2 are not the same size and logical operators between their columns does not make sense. If you map your values in df_1 to Yes, No or NaN using df_2 then you will have equal length columns that can be compared

df_1[(~df_1["a"].isin(df_2["a"])) | (df_1["a"].map(df_2.set_index("a")["b"]) == "Yes")]

CodePudding user response:

You can use:

df = df_1.merge(df_2, how='left', on='a')
print(df[df.b.isin(['Yes', np.nan])][['a']])

OUTPUT

   a
0  1
1  2
2  3
4  5
5  6
6  7
8  9

CodePudding user response:

This won't be the fastest solution but the steps should be easy enough for you to follow. When the two dataframes are merged using a union, NaN values are added in place of missing data. Afterwards you want to retain those rows so you only need to remove those which have the value No.

import pandas as pd

df_1 = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
df_2 = pd.DataFrame({'a': [2, 4, 6, 8, 10], 'b': ['Yes', 'No', 'Yes', 'No', 'No']})

# Merge the dataframes using a union
df = df_1.merge(df_2, how='outer', on='a')
# Drop the rows where 'b' == 'No'
df.drop(df[df['b']=='No'].index, inplace=True)
# Drop column 'b'
df.drop('b', axis=1, inplace=True)
  • Related