How to join dataframes with multiple IDs?-CodePudding

I have two dataframes and a rather tricky join to accomplish.

The first dataframe:

data = [[0, 'Standard1', [100, 101, 102]], [1, 'Standard2', [100, 102]], [2, 'Standard3', [103]]]
 
df1 = pd.DataFrame(data, columns = ['RuleSetID', 'RuleSetName', 'KeyWordGroupID'])
df1 

Output:

RuleSetID   RuleSetName    KeyWordGroupID
    0         Standard1    [100, 101, 102]
    1         Standard2    [100, 102]
    2         Standard3    [103]
   ...         ...          ...

The second one:

data = [[100, 'verahren', ['word1', 'word2']], 
        [101, 'flaechen', ['word3']], 
        [102, 'nutzung', ['word4', 'word5']],
        [103, 'ort', ['word6', 'word7']]]
 
df2 = pd.DataFrame(data, columns = ['KeyWordGroupID', 'KeyWordGroupName', 'KeyWords'])
df2

Output:

KeyWordGroupID  KeyWordGroupName    KeyWords
    100               verahren      ['word1', 'word2']
    101               flaechen      ['word3']
    102               nutzung       ['word4', 'word5']
    103               ort           ['word6', 'word7']
    ...               ...            ...

The desired output:

RuleSetID   RuleSetName    KeyWordGroupID
    0         Standard1    [['word1', 'word2'], ['word3'], ['word4', 'word5']]
    1         Standard2    [['word1', 'word2'], ['word4', 'word5']]
    2         Standard3    [['word6', 'word7']]

I tried to convert the second dataframe into a dictionary using df.to_dict('records') and put it into a pandas apply user defined function to match via key values but it doesn't seem like a clean approach.

Does someone has an approach to solve that? Any ideas are rewarded.

CodePudding user response：

I think you have a couple different options

You can create a dictionary and use map
You can convert the lists to a string and use replace

Option 1

e = df1.explode('KeyWordGroupID')  # explode youre frame
# create a dictionary from KeyWords and map it to the KeyWordGroupID
e['KeyWords'] = e['KeyWordGroupID'].map(df2.set_index('KeyWordGroupID')['KeyWords'].to_dict())
# merge df1 with e
new_df = df1.merge(e.groupby('RuleSetID')['KeyWords'].agg(list), right_index=True, left_on='RuleSetID')

   RuleSetID RuleSetName   KeyWordGroupID  \
0          0   Standard1  [100, 101, 102]   
1          1   Standard2       [100, 102]   
2          2   Standard3            [103]   

                                    KeyWords  
0  [[word1, word2], [word3], [word4, word5]]  
1           [[word1, word2], [word4, word5]]  
2                           [[word6, word7]]

CodePudding user response：

@Corralien had a great answer with pandas to solve the problem. But here I want to present a cleaner way to do it with datar, a re-imagination of pandas APIs:

>>> from datar.all import f, unchop, left_join, group_by, summarise
>>> 
>>> (
...     df1 
...     >> unchop(f.KeyWordGroupID)  # Make KeyWordGroupID one at a row
...     >> left_join(df2, by=f.KeyWordGroupID)  # Attach df2 by KeyWordGroupIDs
...     >> group_by(f.RuleSetID, f.RuleSetName)
...     >> summarise(KeyWords = f.KeyWords.agg(pd.Series))  # Concatenate the KeyWords
... )
[2022-03-28 13:52:38][datar][   INFO] `summarise()` has grouped output by ['RuleSetID'] (override with `_groups` argument)
   RuleSetID RuleSetName                                   KeyWords
     <int64>    <object>                                   <object>
0          0   Standard1  [[word1, word2], [word3], [word4, word5]]
1          1   Standard2           [[word1, word2], [word4, word5]]
2          2   Standard3                             [word6, word7]
[TibbleGrouped: RuleSetID (n=3)]

Same idea with pandas itself:

(
  df1
  .explode("KeyWordGroupID")
  .merge(df2, how="left", on="KeyWordGroupID")
  .groupby(["RuleSetID", "RuleSetName"])
  .agg({"KeyWords": pd.Series})
  .reset_index()
)

CodePudding user response：

The main idea is to convert df2 as a dict mapping Series where the key is the KeyWordGroupID column and the value is the KeyWords column.

You can use explode to flatten KeyWordGroupID column of df1 then map it to df2 then groupby to reshape your first dataframe:

df1['KeyWordGroupID'] = (
    df1['KeyWordGroupID'].explode().map(df2.set_index('KeyWordGroupID')['KeyWords'])
                         .groupby(level=0).apply(list)
)
print(df1)

# Output
   RuleSetID RuleSetName                             KeyWordGroupID
0          0   Standard1  [[word1, word2], [word3], [word4, word5]]
1          1   Standard2           [[word1, word2], [word4, word5]]
2          2   Standard3                           [[word6, word7]]