Home > database >  Filter DF rows using another DF with same amount of columns
Filter DF rows using another DF with same amount of columns

Time:04-14

I have two DataFrames, and I will filter the first one using the second one. My question is, how do I write this properly? Filter out any rows in the first DataFrame that do not match at least one row in the second DataFrame?

df1
SNP_1 SNP_2 SNP_3 SNP_3
A A A A
A C C G
A C T C
A C C G
df2
SNP_1 SNP_2 SNP_3 SNP_3
A A A A
C C C C
T T T T
A G T G

In reality these DataFrames are different shapes

https://imgur.com/a/MIVMie9

I have tried and get an error. That error says "ValueError: Unable to coerce to DataFrame, shape must be (11, 11): given (9020, 11)"

df1.eq(df2.to_numpy())

Also have tried this and get an error. The error here is that the list object has no attribute all(). But how will I pass a list, from a list of lists, while also iterating over each lists values and comparing against the values in df2? Will it be more like this method or the one above?

df2.filter(regex='._', axis=1).to_numpy() == [i for i in df1].all().sum()

CodePudding user response:

So as I understand it you are looking to find rows in df1 where every column's value has a corresponding match in df2? If so, I think I have this figured out, but my method will not be optimal for large dataframes since I employ .itterrows().

#find unique rows from df2
df2_unique = df2.drop_duplicates()

#initalize a boolean list
result_bool_mask = []
for idx,row in df1.iterrows():

    # append this row onto the end of df2's unique vlaues and see if it's duplicated
    # here i use the keys paramater to add a multiindex to the result so we can easily find our result
    result = pd.concat([df2_unique, row.to_frame().T],keys=['df2','df1']).duplicated().loc['df1']

    #append the result onto our list
    result_bool_mask.append(result.values[0])

# use the boolean mask to filter down df2 to just the rows that entirely match a row from df1:
df1.loc[result_bool_mask]

Hope this helps! This is my first time answering a question so please be nice :)

CodePudding user response:

You can create an auxiliary dataframe by merging df1 and index column with df2 and flag column. The flag column tells if the merge keys match.

df_ = df1.assign(Index=df1.index).merge(df2.assign(flag=1), left_on=['SNP_1', 'SNP_2', 'SNP_3', 'SNP_4'], right_on=['SNP_1', 'SNP_2', 'SNP_3', 'SNP_4'], how='left')
print(df_)

  SNP_1 SNP_2 SNP_3 SNP_4  Index  flag
0     A     A     A     A      0   1.0
1     A     C     C     G      1   NaN
2     A     C     T     C      2   NaN
3     A     C     C     G      3   NaN

Then use auxiliary dataframe Index column to tell if the original dataframe rows match.

filter_mask = df1.index.isin(df_[df_['flag'].eq(1)]['Index'].drop_duplicates())
print(filter_mask)

[ True False False False]
  • Related