Merging two data frames with aggregated columns values as result-CodePudding

DataFrame 1

{'id': [1, 2, 3], 'dept': [101, 102, 103]}

id  dept    ....
1   101     ....  
2   102     ....
3   103     ....

DataFrame 2

{'id': [1, 1, 5], 'region1': ['CUD', 'DAS', 'ITF'], 'region2': ['IOP', 'POL', 'IJK']}

id  region1 region2 ...
1   CUD     IOP     ...
1   DAS     POL     ...
5   ITF     IJK     ...

The Result data frame should be as follows

id      dept    concatinated
1       101     [{region1: 'CUD', region2: 'IOP'},{region1: 'DAS', region2: 'POL', ...}]
2       102     []
3       103     []
null    null    [{region1: 'ITF'}, {region2: 'IJK'}, ...]

Note: Columns of Data frames 1 & 2 are dynamic expect id (can have N number of columns) Is there any way to achieve this result using pandas or NumPy!!! (Optimized solutions are appreciable)

CodePudding user response：

My solution seems a bit of complex I am not sure if there is a simple way of doing this.

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'id': [1, 2, 3 ,2 ,6], 'dept': [101, 102, 103 ,104,106]})
df2 = pd.DataFrame({'id': [1, 1, 5, 7], 'region1': ['CUD', 'DAS', 'ITF', "CUD"], 'region2': ['IOP', 'POL', 'IJK',"IOP"]})

df=df1.merge(df2,how="outer")
df["concatinated"] = df.apply(lambda x:{"region1":x.region1,"region2":x.region2},axis=1)
df=df.groupby(["id","dept"],dropna=False).apply(lambda x:[i for i in x.concatinated if pd.notna(i["region1"])]).reset_index()
df=df[(~df.id.duplicated()) | (df['id'].isnull())]
df.loc[~df.id.isin(df1.id),"id"] = np.nan
df=df.rename(columns={0:"concatinated"})
df

      id    dept    concatinated
0    1.0    101.0   [{'region1': 'CUD', 'region2': 'IOP'}, {'regio...
1    2.0    102.0   []
3    3.0    103.0   []
4    NaN    NaN     [{'region1': 'ITF', 'region2': 'IJK'}]
5    6.0    106.0   []
6    NaN    NaN     [{'region1': 'CUD', 'region2': 'IOP'}]

CodePudding user response：

df2['region_comb'] = df2.apply(lambda row: {col: row[col] for col in df2.columns}, axis=1, result_type='reduce')
df2 = df2.groupby('fid')['region_comb'].apply(list).reset_index(name='merged')
result_df = pd.merge(df2, df1, left_on='fid', right_on='fid', how='outer')

solutions works!!!