I want to remove the rows from df1 which are also found in df2 in the same column.
df1 have additional columns compared to df2 which I have not included here to simplify the question
df1 = pd.date_range(start='1/1/2022', end='1/04/2022', freq='D')
df2 = pd.date_range(start='1/1/2022', end='1/08/2022', freq='D')
df1 = pd.DataFrame(df1, columns=['date'])
df2 = pd.DataFrame(df2, columns=['date'])
# this line does not remove the duplicates
df3 = df2.drop_duplicates(df1.columns[0:])
The final result should be
0 2022-01-05
1 2022-01-06
2 2022-01-07
3 2022-01-08
CodePudding user response:
Drop columns in df1 which are also found in df2
df1.drop(columns=df2.columns, errors='ignore', inplace=True)
or
df1 = df1.drop(columns=df2.columns, errors='ignore')
Drop rows in df1 and also in df2 in a specific column say date
Following your edit, if it is a single column like date, please try
df1[~df1['date'].isin(df2['date'])]
If it is a check on multiple columns, it can also be done. However, we will need more info. What happens if column1 in both df has same values in df1 and df2 and in the same row a column2 in both df has different values.?
CodePudding user response:
drop_duplicates
only works for rows. If you want to remove columns from one df that are in another df, use ~df2.columns.isin(df1.columns)
to return False for columns that are in df1
(False
= should not be kept), and True
for columns that are not in df1
(True
= should be kept). Then pass the resulting column mask to .loc
, at the second position.
This line will remove the columns from df2
that are in df1
:
df2 = df2.loc[:, ~df2.columns.isin(df1.columns)]