I have two DataFrames, and I will filter the first one using the second one. My question is, how do I write this properly? Filter out any rows in the first DataFrame that do not match at least one row in the second DataFrame?
df1
SNP_1 | SNP_2 | SNP_3 | SNP_3 |
---|---|---|---|
A | A | A | A |
A | C | C | G |
A | C | T | C |
A | C | C | G |
df2
SNP_1 | SNP_2 | SNP_3 | SNP_3 |
---|---|---|---|
A | A | A | A |
C | C | C | C |
T | T | T | T |
A | G | T | G |
In reality these DataFrames are different shapes
I have tried and get an error. That error says "ValueError: Unable to coerce to DataFrame, shape must be (11, 11): given (9020, 11)"
df1.eq(df2.to_numpy())
Also have tried this and get an error. The error here is that the list object has no attribute all(). But how will I pass a list, from a list of lists, while also iterating over each lists values and comparing against the values in df2? Will it be more like this method or the one above?
df2.filter(regex='._', axis=1).to_numpy() == [i for i in df1].all().sum()
CodePudding user response:
So as I understand it you are looking to find rows in df1 where every column's value has a corresponding match in df2? If so, I think I have this figured out, but my method will not be optimal for large dataframes since I employ .itterrows().
#find unique rows from df2
df2_unique = df2.drop_duplicates()
#initalize a boolean list
result_bool_mask = []
for idx,row in df1.iterrows():
# append this row onto the end of df2's unique vlaues and see if it's duplicated
# here i use the keys paramater to add a multiindex to the result so we can easily find our result
result = pd.concat([df2_unique, row.to_frame().T],keys=['df2','df1']).duplicated().loc['df1']
#append the result onto our list
result_bool_mask.append(result.values[0])
# use the boolean mask to filter down df2 to just the rows that entirely match a row from df1:
df1.loc[result_bool_mask]
Hope this helps! This is my first time answering a question so please be nice :)
CodePudding user response:
You can create an auxiliary dataframe by merging df1 and index column with df2 and flag
column. The flag
column tells if the merge keys match.
df_ = df1.assign(Index=df1.index).merge(df2.assign(flag=1), left_on=['SNP_1', 'SNP_2', 'SNP_3', 'SNP_4'], right_on=['SNP_1', 'SNP_2', 'SNP_3', 'SNP_4'], how='left')
print(df_)
SNP_1 SNP_2 SNP_3 SNP_4 Index flag
0 A A A A 0 1.0
1 A C C G 1 NaN
2 A C T C 2 NaN
3 A C C G 3 NaN
Then use auxiliary dataframe Index
column to tell if the original dataframe rows match.
filter_mask = df1.index.isin(df_[df_['flag'].eq(1)]['Index'].drop_duplicates())
print(filter_mask)
[ True False False False]