I have the following DataFrame that i got after left merging two dfs:
A | B | C_x | D_x | C_y | D_y |
---|---|---|---|---|---|
0 | 1136006640 | NaT | NaT | 2022-11-18 | 2022-11-24 |
1 | 1136030941 | NaT | NaT | 2022-11-30 | 2022-12-09 |
3 | 1136051543 | 2022-11-25 | 2022-12-16 | NaT | NaT |
4 | 1136072241 | NaT | NaT | 2022-12-20 | 2022-12-29 |
How can I join the C and D columns, so that the NaT values are replaced by the date. If the _x columns has data, the _y column is always NaT and vice versa. Can I fix this already, when merging or do i have to do it afterwards?
CodePudding user response:
The code below constructs new columns C_z
,D_z
. To this end, it copies dates from columns C_x
, D_x
. If there are no dates for a specific row in both columns, it copies dates from C_y
,D_y
.
import pandas as pd
import numpy as np
from io import StringIO
# setup sample data
data = """
A;B;C_x;D_x;C_y;D_y
0;1136006640;NaT;NaT;2022-11-18;2022-11-24
1;1136030941;NaT;NaT;2022-11-30;2022-12-09
3;1136051543;2022-11-25;2022-12-16;NaT;NaT
4;1136072241;NaT;NaT;2022-12-20;2022-12-29
"""
df = pd.read_csv(StringIO(data), sep=';', parse_dates=['C_x','D_x','C_y','D_y'])
# combine dates
mask = df[['C_y','D_y']].isnull().all(axis=1)
df.loc[mask,'C_z'] = df['C_x']
df.loc[mask,'D_z'] = df['D_x']
Result:
A B C_x D_x C_y D_y C_z D_z
0 0 1136006640 NaT NaT 2022-11-18 2022-11-24 2022-11-18 2022-11-24
1 1 1136030941 NaT NaT 2022-11-30 2022-12-09 2022-11-30 2022-12-09
2 3 1136051543 2022-11-25 2022-12-16 NaT NaT 2022-11-25 2022-12-16
3 4 1136072241 NaT NaT 2022-12-20 2022-12-29 2022-12-20 2022-12-29
CodePudding user response:
Use DataFrame.combine_first
with remove suffixes for same columns names:
df1 = df.filter(regex='_x$').rename(columns=lambda x: x.replace('_x',''))
df2 = df.filter(regex='_y$').rename(columns=lambda x: x.replace('_y',''))
df = df.join(df1.combine_first(df2))
print (df)
A B C_x D_x C_y D_y C \
0 0 1136006640 NaT NaT 2022-11-18 2022-11-24 2022-11-18
1 1 1136030941 NaT NaT 2022-11-30 2022-12-09 2022-11-30
2 3 1136051543 2022-11-25 2022-12-16 NaT NaT 2022-11-25
3 4 1136072241 NaT NaT 2022-12-20 2022-12-29 2022-12-20
D
0 2022-11-24
1 2022-12-09
2 2022-12-16
3 2022-12-29
If input data are like:
print (df1)
B C D
0 1136051543 2022-11-25 2022-12-16
print (df2)
B C D
0 1136006640 2022-11-18 2022-11-24
1 1136030941 2022-11-30 2022-12-09
2 1136072241 2022-12-20 2022-12-29
Then instead merge
use concat
(if df1.index
not exist in df2.index
, so merge
create missing values like in question):
df = pd.concat([df1, df2]).sort_values('B', ignore_index=True)
print (df)
B C D
0 1136006640 2022-11-18 2022-11-24
1 1136030941 2022-11-30 2022-12-09
2 1136051543 2022-11-25 2022-12-16
3 1136072241 2022-12-20 2022-12-29
CodePudding user response:
Can you explain what do you want to do a bit more, or even show what the final data set should look like?