Home > Software design >  How to compare two data frames and get index of identical rows (Python)
How to compare two data frames and get index of identical rows (Python)

Time:04-25

I have two dataframes, DF1 and DF2, and need to find each identical row and get the index of the row as it appears in DF2. Each row in DF2 is unique, however there will be duplicate rows in DF1.
Each row in DF1 will always match a row in DF2.

Essentially I need to create a key that shows the index of where each row in DF1 occurs in DF2.

This is my first time posting to Stack Overflow, and I am not sure how to actually show the data frames, but I hope this makes sense

# Define a dictionary containing data
df1 = {'A': [1,2,3,4,1],'C': [9,10,11,12,9],}
  
df2 = {'A': [1,2,3,4],'C': [9,10,11,12]}

# Convert the dictionary into DataFrame
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)

I have written the following code, which works and produces the desired output.

Code:

key = []
for i in range(len(df1)): 
  row1 = df1.loc[i, "A"], df1.loc[i, "C"] #Isolate A and C columns from Df1

  for j in range(len(df2)):
    row2 = df2.loc[j, "A"], df2.loc[j, "C"] #Isolate A and C columns from Df2

    if row1 == row2: 
      key.append(j) #if they are the same, add index of df2 to list
      break 

    else: 
      pass

Output:

>>>key
[0,1,2,3,0]

The problem is that the file I am working on is very large, and this code is extremely slow. Is there a faster way to find the index of the identical rows?

CodePudding user response:

Convert index to column index for avoid remove it by DataFrame.merge by all intersection of columns names in both DataFrames, then sorting index column and generate first values per all columns by GroupBy.transform with first:

a = (df1.reset_index()
        .merge(df2.drop_duplicates())
        .sort_values('index')
        .groupby(df1.columns.tolist())['index']
        .transform('first')
        .tolist())
print (a)
[0, 1, 2, 3, 0]
  • Related