so I have 2 dataframes df1 and df2.
df1
names = ['Bob', 'Joe', '', 'Bob', '0000', 'Alice', 'Joe', 'Alice', 'Bob', '']
df1 = pd.DataFrame({'names': names,'ages': ages})
df2
names_in_student_db = [' Bob', ' Joe ', '', ' Bob ', 'Chris', 'Alice', 'Joe ', 'Alice ', ' Bob ', 'Daniel']
df2 = pd.DataFrame({'student_names': names_in_student_db,'grades': grades})
Now, I want to merge these 2 dataframes but obviously, there are 2 problems:
- names and names_in_student_db are not fully identical.
- Both of them contain duplicates — this seems to be making merge functions to throw an error. Also, duplicates in one column are not the same (meaning let's say, 1st Bob and 3rd Bob in any of these columns are not the same person), but let's say the 2nd Bob in 1st column and 2nd Bob in the 2nd column are the same person.
So how do I write a general code (not tailored for these specific dataframes) to solve this? I'm looking for outer join btw.
My guess is I could create another column in each dataframe, let's call it 'order' column which basically would be basically integers from 0 to 9. And then if I could merge dataframes based on 2 columns (I mean matching 'order1' column with 'order2' and 'names' with 'student_names'). Is that possible? I think that still throws a duplicate-related error though.
CodePudding user response:
If they always match up based on index, you can just concat them together and then drop columns you no longer want.
pd.concat([df1, df2], axis=1)
CodePudding user response:
You could clean up the student names, and assign a sequence number to all repeated names (on both DFs), assuming the order is the same. In reality, you would rather use a last name and optionally some other identifier, just to make sure you are joining the right people with their grades :-)
z = df2.assign(names=df2['student_names'].str.strip())
out = (
df1
.assign(seq=df1.groupby('names').cumcount())
.merge(
z
.assign(seq=z.groupby('names').cumcount()),
on=['names', 'seq'],
how='left',
)
)
>>> out
names ages seq student_names grades
0 Bob 33 0 Bob A
1 Joe 45 0 Joe F
2 21 0 B
3 Bob 38 1 Bob F
4 0000 44 0 NaN NaN
5 Alice 10 0 Alice C
6 Joe 10 1 Joe C
7 Alice 46 1 Alice A
8 Bob 15 2 Bob B
9 48 1 NaN NaN
PS: actual setup
The setup in the question was incomplete (missing ages
and grades
), so I improvised:
names = ['Bob', 'Joe', '', 'Bob', '0000', 'Alice', 'Joe', 'Alice', 'Bob', '']
ages = np.random.randint(10, 50, len(names))
df1 = pd.DataFrame({'names': names,'ages': ages})
names_in_student_db = [' Bob', ' Joe ', '', ' Bob ', 'Chris', 'Alice', 'Joe ', 'Alice ', ' Bob ', 'Daniel']
grades = np.random.choice(list('ABCDF'), len(names_in_student_db))
df2 = pd.DataFrame({'student_names': names_in_student_db,'grades': grades})
CodePudding user response:
Specify the column name like this
pd.merge(left=df1, right=df2, left_on='names', right_on='student_names', how='left')
depending on your expected result.