Home > OS >  Pandas DataFrame how to compare each cell of a colum with each cell of another column in another dat
Pandas DataFrame how to compare each cell of a colum with each cell of another column in another dat

Time:09-07

I have an enormous dataframe (D1) which is just a column of ID's (around 700,000) and I have another dataframe (D2) which consists of data related to people and one of its columns represents and ID. I need to compare each cell from D2's column of IDs with each value in D1's column. The rows with a matching ID, must be eliminated from D2.

Here is the part of the code I made to compare and drop but I think I'm not using the drop function correctly. Due to the dataframes being too big and my pc or google colab crashing due to lack of memory, I had to divide D2 in smaller dataframes and add them to a dictionary (dataframe_collection) which is then used again to join each of its mini dataframes into a bigger one and print a csv file. What is happening is that the smaller dataframes are always empty and I'm being able to identify which IDs are present in both dataframes, and which are not.

times = math.ceil(D2.shape[0]/500)
print(times)
dataframe_collection = {}

for i in range(times):
  dataframe_collection[i] = D2.iloc[i*500: (i*500) 500]
  dataframe_collection[i] = dataframe_collection[i].drop(dataframe_collection[i][dataframe_collection[i]["ID"] == D1].index)

Data set examples:

data1 = ['23100100011930000021', '23100100011930000031','23100100011930000061','23100100011930000081','23100100011930000091','23100100011930000101','23100100011930000181']
D1 = pd.DataFrame(data1, columns=['ID'])


data2 = [['tom', '23100100011930000021', 'USA'], ['nick', '23100100011930000031', 'DEU'], ['juli', '23100100011930000061', 'USA'], ['charlie', '23100100011930000722', 'MEX']]
D2 = pd.DataFrame(data2, columns=['Name', 'Age', 'Country'])

In this case, the last register ['charlie', '23100100011930000722', 'MEX'] should be left in the final dataframe and the other ones should be dropped

CodePudding user response:

I think this is what you are looking for:

D2 = D2.loc[~D2.Age.isin(D1.ID)]
  • Related