Home > Mobile >  Pandas Dataframe replace NaN values of column A with values from column B
Pandas Dataframe replace NaN values of column A with values from column B

Time:01-19

I have the following DataFrame that i got after left merging two dfs:

A B C_x D_x C_y D_y
0 1136006640 NaT NaT 2022-11-18 2022-11-24
1 1136030941 NaT NaT 2022-11-30 2022-12-09
3 1136051543 2022-11-25 2022-12-16 NaT NaT
4 1136072241 NaT NaT 2022-12-20 2022-12-29

How can I join the C and D columns, so that the NaT values are replaced by the date. If the _x columns has data, the _y column is always NaT and vice versa. Can I fix this already, when merging or do i have to do it afterwards?

CodePudding user response:

The code below constructs new columns C_z,D_z. To this end, it copies dates from columns C_x, D_x. If there are no dates for a specific row in both columns, it copies dates from C_y,D_y.

import pandas as pd
import numpy as np

from io import StringIO
    
# setup sample data
data = """  
A;B;C_x;D_x;C_y;D_y
0;1136006640;NaT;NaT;2022-11-18;2022-11-24
1;1136030941;NaT;NaT;2022-11-30;2022-12-09
3;1136051543;2022-11-25;2022-12-16;NaT;NaT
4;1136072241;NaT;NaT;2022-12-20;2022-12-29
"""
df = pd.read_csv(StringIO(data), sep=';', parse_dates=['C_x','D_x','C_y','D_y'])

# combine dates
mask = df[['C_y','D_y']].isnull().all(axis=1)
df.loc[mask,'C_z'] = df['C_x']
df.loc[mask,'D_z'] = df['D_x']

Result:

   A           B        C_x        D_x        C_y        D_y        C_z        D_z
0  0  1136006640        NaT        NaT 2022-11-18 2022-11-24 2022-11-18 2022-11-24
1  1  1136030941        NaT        NaT 2022-11-30 2022-12-09 2022-11-30 2022-12-09
2  3  1136051543 2022-11-25 2022-12-16        NaT        NaT 2022-11-25 2022-12-16
3  4  1136072241        NaT        NaT 2022-12-20 2022-12-29 2022-12-20 2022-12-29

CodePudding user response:

Use DataFrame.combine_first with remove suffixes for same columns names:

df1 = df.filter(regex='_x$').rename(columns=lambda x: x.replace('_x',''))
df2 = df.filter(regex='_y$').rename(columns=lambda x: x.replace('_y',''))

df = df.join(df1.combine_first(df2))
print (df)
   A           B        C_x        D_x        C_y        D_y          C  \
0  0  1136006640        NaT        NaT 2022-11-18 2022-11-24 2022-11-18   
1  1  1136030941        NaT        NaT 2022-11-30 2022-12-09 2022-11-30   
2  3  1136051543 2022-11-25 2022-12-16        NaT        NaT 2022-11-25   
3  4  1136072241        NaT        NaT 2022-12-20 2022-12-29 2022-12-20   

           D  
0 2022-11-24  
1 2022-12-09  
2 2022-12-16  
3 2022-12-29  

If input data are like:

print (df1)
            B          C          D
0  1136051543 2022-11-25 2022-12-16

print (df2)
            B          C          D
0  1136006640 2022-11-18 2022-11-24
1  1136030941 2022-11-30 2022-12-09
2  1136072241 2022-12-20 2022-12-29

Then instead merge use concat (if df1.index not exist in df2.index, so merge create missing values like in question):

df = pd.concat([df1, df2]).sort_values('B', ignore_index=True)
print (df)
            B          C          D
0  1136006640 2022-11-18 2022-11-24
1  1136030941 2022-11-30 2022-12-09
2  1136051543 2022-11-25 2022-12-16
3  1136072241 2022-12-20 2022-12-29

CodePudding user response:

Can you explain what do you want to do a bit more, or even show what the final data set should look like?

  • Related