How to merge 2 dataframes on multiple columns containing duplicates-CodePudding

so I have 2 dataframes df1 and df2.

df1

names = ['Bob', 'Joe', '', 'Bob', '0000', 'Alice', 'Joe', 'Alice', 'Bob', '']
df1 = pd.DataFrame({'names': names,'ages': ages})

df2

names_in_student_db = [' Bob', ' Joe ', '', ' Bob ', 'Chris', 'Alice', 'Joe ', 'Alice ', ' Bob ', 'Daniel']
df2 = pd.DataFrame({'student_names': names_in_student_db,'grades': grades})

Now, I want to merge these 2 dataframes but obviously, there are 2 problems:

names and names_in_student_db are not fully identical.
Both of them contain duplicates — this seems to be making merge functions to throw an error. Also, duplicates in one column are not the same (meaning let's say, 1st Bob and 3rd Bob in any of these columns are not the same person), but let's say the 2nd Bob in 1st column and 2nd Bob in the 2nd column are the same person.

So how do I write a general code (not tailored for these specific dataframes) to solve this? I'm looking for outer join btw.

My guess is I could create another column in each dataframe, let's call it 'order' column which basically would be basically integers from 0 to 9. And then if I could merge dataframes based on 2 columns (I mean matching 'order1' column with 'order2' and 'names' with 'student_names'). Is that possible? I think that still throws a duplicate-related error though.

CodePudding user response：

If they always match up based on index, you can just concat them together and then drop columns you no longer want.

pd.concat([df1, df2], axis=1)

CodePudding user response：

You could clean up the student names, and assign a sequence number to all repeated names (on both DFs), assuming the order is the same. In reality, you would rather use a last name and optionally some other identifier, just to make sure you are joining the right people with their grades :-)

z = df2.assign(names=df2['student_names'].str.strip())

out = (
    df1
    .assign(seq=df1.groupby('names').cumcount())
    .merge(
        z
        .assign(seq=z.groupby('names').cumcount()),
        on=['names', 'seq'],
        how='left',
    )
)

>>> out
   names  ages  seq student_names grades
0    Bob    33    0           Bob      A
1    Joe    45    0          Joe       F
2           21    0                    B
3    Bob    38    1          Bob       F
4   0000    44    0           NaN    NaN
5  Alice    10    0         Alice      C
6    Joe    10    1          Joe       C
7  Alice    46    1        Alice       A
8    Bob    15    2          Bob       B
9           48    1           NaN    NaN

PS: actual setup

The setup in the question was incomplete (missing ages and grades), so I improvised:

names = ['Bob', 'Joe', '', 'Bob', '0000', 'Alice', 'Joe', 'Alice', 'Bob', '']
ages = np.random.randint(10, 50, len(names))
df1 = pd.DataFrame({'names': names,'ages': ages})

names_in_student_db = [' Bob', ' Joe ', '', ' Bob ', 'Chris', 'Alice', 'Joe ', 'Alice ', ' Bob ', 'Daniel']
grades = np.random.choice(list('ABCDF'), len(names_in_student_db))
df2 = pd.DataFrame({'student_names': names_in_student_db,'grades': grades})

CodePudding user response：

Specify the column name like this

pd.merge(left=df1, right=df2, left_on='names', right_on='student_names', how='left')

depending on your expected result.