merge dfs that have similar column values-CodePudding

I have 2 large pandas dfs, where the first one contains various IDs a specific student. The other one has one or more of the IDs that match. I want to merge the dfs such that it will combine based on if df2 has a value found in df1.

DF1:

ID                 name
nfi23, wjm348      sally
hji21, arb128      joe
mbi13, ybm328      mary

DF2

ID                       grade        
nfi23                    twelfth       
arb128, hji21            third
mbi13                    fourth

desired:

ID                 name         grade
nfi23, wjm348      sally        twelfth
hji21, arb128      joe          third
mbi13, ybm328      mary         fourth

CodePudding user response：

# Break the ids apart.
df1.ID = df1.ID.str.split(', ')
df2.ID = df2.ID.str.split(', ')
df1 = df1.explode('ID')
df2 = df2.explode('ID')

# Merge on ID
out = (df1.merge(df2, 'left')
          .groupby('name') # Bring it all back together~
          .agg(set)
          .applymap(lambda x: ', '.join(i for i in x if isinstance(i, str)))
          .reset_index())
print(out)

Output:

    name             ID    grade
0    joe  hji21, arb128    third
1   mary  ybm328, mbi13   fourth
2  sally  nfi23, wjm348  twelfth

CodePudding user response：

Try with explode then we can do map

df2['ID'] = df2['ID'].str.split(',')
df2 = df2.explode('ID')
df1['grade'] = df1.ID.str.split(',').explode().map(df2.set_index('ID')['grade']).groupby(level=0).first()
df1
Out[774]: 
             ID   name    grade
0  nfi23,wjm348  sally  twelfth
1  hji21,arb128    joe    third
2  mbi13,ybm328   mary   fourth

CodePudding user response：

df = pd.DataFrame(columns=['ID', 'name', 'grade'])
for i, (ids, name) in df1.iterrows():
    for j, ids2 in df2.ID.iteritems():
        if len(set(ids) & set(ids2)) > 0:
            break
    df.loc[i] = ids, name, df2.loc[j, 'grade']
print(df)

prints

index	ID	name	grade
0	nfi23,wjm348	sally	twelfth
1	hji21,arb128	joe	third
2	mbi13,ybm328	mary	fourth

CodePudding user response：

You may use pd.Series.map after making a series out of split list in df1's ID and then take the first value.

df2['ID'] = df2['ID'].str.split(',')
df2 = df2.explode('ID')
df2['ID'] = df2['ID'].str.strip()
d = df2.set_index('ID')['grade']
df1['grade'] = df1['ID'].apply(lambda x: pd.Series(x.split(',')).map(d)[0])

Output:

               ID   name    grade
0  nfi123, wjm348  sally  twelfth
1   hji21, arb128    joe    third
2   mbi13, ybm328   mary   fourth