Home > Software design >  Drop duplicate rows based on lookup or another dataframe
Drop duplicate rows based on lookup or another dataframe

Time:04-20

I have 2 dataframes. DF1 is my main dataframe, DF2 is another data set. I need to drop rows from DF1 contain "Security IDs" found in DF2. This new DF will be called DF3. How would I do this?

Sample Dataframe:

df1 = pd.DataFrame({'Security ID': ['10003', '10009', '10010', '10034', '10665'], 
               'SomeNum': [12,13,13,13,13],
               'Color of dogs': ['red','orange','red','red','red'],
              'Date1': [20120506, 20120506, 20120506,20120506,20120620],
              'Date2': [20120528, 20120507, 20120615,20120629,20120621]})

df2 = pd.DataFrame({'Security ID': ['10003', '10009', '10010', '70011', '70012','70034','70034', '70034'], 
               'SomeNum': [12,13,13,13,13,13,14,14],
               'Num2': [121,122,122,124,125,126,127,128],
              'InterestData': [20120506, 20120506, 20120506,20120506,20120620,20120506,20120206,20120405],
              'UnreleatedStuff': [20120528, 20120507, 20120615,20120629,20120621,20120629,20120506,20120506]})

Required Resulting Dataframe:

df3 = pd.DataFrame({'Security ID': ['10034', '10665'], 
               'SomeNum': [13,13],
               'Color of dogs': ['red','red'],
              'Date1': [20120506,20120620],
              'Date2': [20120629,20120621]})

How would I go about this? I tried df3=[~df2.index.isin(df1.index)] after changing the index by df.set_index('Security ID')

CodePudding user response:

Just negate the condition with the boolean NOT operator ~:

df3 = df1[~df1['Security ID'].isin(df2['Security ID'])]

output:

  Security ID  SomeNum Color of dogs     Date1     Date2
3       10034       13           red  20120506  20120629
4       10665       13           red  20120620  20120621

Another nice trick to know, to use several columns to drop, is to perform a reverse merge:

cols = ['Security ID'] # columns to use
(df1
 .merge(df2[cols].drop_duplicates(), how='outer', indicator=True)
 .query('_merge == "left_only"').drop(columns='_merge')
)

CodePudding user response:

IIUC you want this:

df3 = df1[~df1['Security ID'].isin(df2['Security ID'])]

CodePudding user response:

I have a less cool and more sql like solution:

Left join on Security ID and using indicator=True tells you the appearances of the row.

df3 = df1.merge(df2[['Security ID']], on='Security ID', how='left', indicator=True)

10034 and 10665 does not appear in df2 so filter for left_only in the merge column

df3 = df3[df3['_merge'] == 'left_only']

And then you can just drop the _merge column after that

df3.drop('_merge', 1)
  • Related