I have two dataframes, DF1 and DF2, and need to find each identical row and get the index of the row as it appears in DF2.
Each row in DF2 is unique, however there will be duplicate rows in DF1.
Each row in DF1 will always match a row in DF2.
Essentially I need to create a key that shows the index of where each row in DF1 occurs in DF2.
This is my first time posting to Stack Overflow, and I am not sure how to actually show the data frames, but I hope this makes sense
# Define a dictionary containing data
df1 = {'A': [1,2,3,4,1],'C': [9,10,11,12,9],}
df2 = {'A': [1,2,3,4],'C': [9,10,11,12]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
I have written the following code, which works and produces the desired output.
Code:
key = []
for i in range(len(df1)):
row1 = df1.loc[i, "A"], df1.loc[i, "C"] #Isolate A and C columns from Df1
for j in range(len(df2)):
row2 = df2.loc[j, "A"], df2.loc[j, "C"] #Isolate A and C columns from Df2
if row1 == row2:
key.append(j) #if they are the same, add index of df2 to list
break
else:
pass
Output:
>>>key
[0,1,2,3,0]
The problem is that the file I am working on is very large, and this code is extremely slow. Is there a faster way to find the index of the identical rows?
CodePudding user response:
Convert index
to column index
for avoid remove it by DataFrame.merge
by all intersection of columns names in both DataFrames, then sorting index
column and generate first values per all columns by GroupBy.transform
with first
:
a = (df1.reset_index()
.merge(df2.drop_duplicates())
.sort_values('index')
.groupby(df1.columns.tolist())['index']
.transform('first')
.tolist())
print (a)
[0, 1, 2, 3, 0]