I have these dataframes:
import pandas as pd
import numpy as np
from functools import reduce
a = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'A_val': [0.1, np.nan, 0.3, np.nan, 0.5], 'B_val': [1.233, np.nan, 1.4, np.nan, 1.9]})
b = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'A_val': [np.nan, 0.2, np.nan, 0.4, np.nan], 'B_val': [np.nan, 1.56, np.nan, 1.1, np.nan]})
c = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'C_val': [121, np.nan, 334, np.nan, 555], 'D_val': [10.233, np.nan, 10.4, np.nan, 10.9]})
d = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'C_val': [np.nan, 322, np.nan, 454, np.nan], 'D_val': [np.nan, 10.56, np.nan, 10.1, np.nan]})
I am dropping the nan values:
a.dropna(inplace=True)
b.dropna(inplace=True)
c.dropna(inplace=True)
d.dropna(inplace=True)
And then , I want to merge them and have this result:
id gr_code A_val B_val C_val D_val
1 121 0.1 1.233 121.0 10.233
2 121 0.2 1.56 322 10.56
3 134 0.3 1.400 334.0 10.400
4 155 0.4 1.10 454.0 10.10
5 156 0.5 1.900 555.0 10.900
but whatever I try , it introduces nan values.
For example:
df = pd.concat([a, b, c, d], axis=1)
df = df.loc[:,~df.columns.duplicated()]
gives:
id gr_code A_val B_val C_val D_val
1.0 121.0 0.1 1.233 121.0 10.233
3.0 134.0 0.3 1.400 334.0 10.400
5.0 156.0 0.5 1.900 555.0 10.900
NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN
If I try:
df_list = [a, b, c, d]
df = reduce(lambda left, right: pd.merge(left, right,
on=['id', 'gr_code'],
how='outer'), df_list)
it gives:
id gr_code A_val_x B_val_x A_val_y B_val_y C_val_x D_val_x C_val_y D_val_y
1 121 0.1 1.233 NaN NaN 121.0 10.233 NaN NaN
3 134 0.3 1.400 NaN NaN 334.0 10.400 NaN NaN
5 156 0.5 1.900 NaN NaN 555.0 10.900 NaN NaN
2 121 NaN NaN 0.2 1.56 NaN NaN 322.0 10.56
4 155 NaN NaN 0.4 1.10 NaN NaN 454.0 10.10
more dataframes:
e = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'E_val': [0.11, np.nan, 0.13, np.nan, 0.35], 'F_val': [11.233, np.nan, 11.4, np.nan, 11.9]})
f = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'E_val': [np.nan, 3222, np.nan, 4541, np.nan], 'F_val': [np.nan, 110.56, np.nan, 101.1, np.nan]})
CodePudding user response:
You can use concat
and merge the duplicated columns:
df = (pd.concat([d.set_index(['id', 'gr_code']) for d in df_list], axis=1)
.groupby(level=0, axis=1).first().reset_index()
)
output:
id gr_code A_val B_val C_val D_val
0 1 121 0.1 1.233 121.0 10.233
1 2 121 0.2 1.560 322.0 10.560
2 3 134 0.3 1.400 334.0 10.400
3 4 155 0.4 1.100 454.0 10.100
4 5 156 0.5 1.900 555.0 10.900
CodePudding user response:
It looks like you can make this work with a merge
of two concats
:
pd.concat([a, b], axis=0).merge(pd.concat([c, d], axis=0), on=['id', 'gr_code'])
Output for your sample data:
id gr_code A_val B_val C_val D_val
0 1 121 0.1 1.233 121.0 10.233
1 3 134 0.3 1.400 334.0 10.400
2 5 156 0.5 1.900 555.0 10.900
3 2 121 0.2 1.560 322.0 10.560
4 4 155 0.4 1.100 454.0 10.100