Home > Blockchain >  Pandas loop over 2 dataframe and drop duplicates
Pandas loop over 2 dataframe and drop duplicates

Time:09-25

I have 2 csv files with some random numbers, as follow:

csv1.csv

0       906018
1       007559
2       910475
3       915104
4       600393
         ...  
5070    907525
5071    903079
5072    001910
5073    909735
5074    914861

length 5075

csv2.csv

0         5555
1         7859
2       501303
3       912414
4       913257
         ...  
7497    915031
7498    915030
7499    915033
7500    902060
7501    915038

length 7502

Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.

so I wrote the following code:

import pandas as pd
import csv


unique_users = pd.read_csv('./csv1.csv')
unique_users['id']

identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber


df = pd.concat([identityNumber, unique_users])

Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.

the df concat it did its job and concat based on the index, so now I have tons of NaN.

and when I use the code:

final_result = df.drop_duplicates(keep=False)

The data frame does not drop any value because the df structure now look like this:

Identitynumber.    ID
5555               NaN

so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.

So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.

Can anyone help with this please?

And please if you need more info just let me know.

CodePudding user response:

This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.

df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

CodePudding user response:

You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.

Try this:

pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)
  • Related