I'm using Python 3.0 and Pandas to clean some data.
I've the following table :
# Item_ID Date_1 Date_2
0 1857 2020-11-05 00:00:00 2020-12-05 00:00:00
1 1569 2020-12-09 00:00:00 2021-01-07 00:00:00
2 2569 2020-12-09 00:00:00 NaN
3 6587 2020-12-09 00:00:00 2021-10-08 00:00:00
4 5236 2020-12-09 00:00:00 -
The code to create the dataframe in order to make it easy:
d = {'Item_ID': [1857, 1569, 2569, 6587, 5236], 'Date_1': ['2020-11-05 00:00:00', '2020-12-09 00:00:00', '2020-12-09 00:00:00', '2020-12-09 00:00:00', '2020-12-09 00:00:00'], 'Date_2': ['2020-12-05 00:00:00', '2021-01-07 00:00:00', 'NaN', '2021-10-08 00:00:00', '-']}
df = pd.DataFrame(data=d)
I would like to merge the columns 'Date 1' and 'Date 2' using an efficient method because I can have a really big dataframe. The result would be the following one:
# Item_ID Date 3
0 1857 2020-12-05 00:00:00
1 1569 2021-01-07 00:00:00
2 2569 2020-12-09 00:00:00
3 6587 2021-10-08 00:00:00
4 5236 2020-12-09 00:00:00
The date on 'Date 1' will only be replaced by the content of 'Date 2' if the content of 'Date 2' is a date (no matter if is > or < than the date of 'Date 1').
Can this be done with a merge ?
CodePudding user response:
You can use a mask with help of pandas.to_datetime
to ensure that you have dates:
mask = pd.to_datetime(df['Date_2'], errors='coerce').isna()
df['Date_3'] = df['Date_1'].where(mask, df['Date_2'])
output:
Item_ID Date_1 Date_2 Date_3
0 1857 2020-11-05 00:00:00 2020-12-05 00:00:00 2020-12-05 00:00:00
1 1569 2020-12-09 00:00:00 2021-01-07 00:00:00 2021-01-07 00:00:00
2 2569 2020-12-09 00:00:00 NaN 2020-12-09 00:00:00
3 6587 2020-12-09 00:00:00 2021-10-08 00:00:00 2021-10-08 00:00:00
4 5236 2020-12-09 00:00:00 - 2020-12-09 00:00:00
Or, if you want to have datetime type:
df['Date_3'] = pd.to_datetime(df['Date_2'], errors='coerce').fillna(df['Date_1'])
output:
Item_ID Date_1 Date_2 Date_3
0 1857 2020-11-05 00:00:00 2020-12-05 00:00:00 2020-12-05
1 1569 2020-12-09 00:00:00 2021-01-07 00:00:00 2021-01-07
2 2569 2020-12-09 00:00:00 NaN 2020-12-09
3 6587 2020-12-09 00:00:00 2021-10-08 00:00:00 2021-10-08
4 5236 2020-12-09 00:00:00 - 2020-12-09
Alternative output:
keep = ['Item_ID']
df[keep].join(pd.to_datetime(df['Date_2'], errors='coerce')
.fillna(df['Date_1'])
.rename('Date_3')
)
Item_ID Date_3
0 1857 2020-12-05
1 1569 2021-01-07
2 2569 2020-12-09
3 6587 2021-10-08
4 5236 2020-12-09
CodePudding user response:
You Can do something like below
from datetime import datetime
import pandas as pd
def datetimeChecker(date):
try:
datetime.strptime(date,"%Y-%m-%d %H:%M:%S")
return True
except:
return False
d = {'Item_ID': [1857, 1569, 2569, 6587, 5236], 'Date_1': ['2020-11-05 00:00:00', '2020-12-09 00:00:00', '2020-12-09 00:00:00', '2020-12-09 00:00:00', '2020-12-09 00:00:00'], 'Date_2': ['2020-12-05 00:00:00', '2021-01-07 00:00:00', 'NaN', '2021-10-08 00:00:00', '-']}
df = pd.DataFrame(data=d)
df["final_date"]=df.apply(lambda x: x['Date_2'] if datetimeChecker(x['Date_2']) else x['Date_1'],axis=1)
And the output look like below:
Item_ID Date_1 Date_2 final_date
0 1857 2020-11-05 00:00:00 2020-12-05 00:00:00 2020-12-05 00:00:00
1 1569 2020-12-09 00:00:00 2021-01-07 00:00:00 2021-01-07 00:00:00
2 2569 2020-12-09 00:00:00 NaN 2020-12-09 00:00:00
3 6587 2020-12-09 00:00:00 2021-10-08 00:00:00 2021-10-08 00:00:00
4 5236 2020-12-09 00:00:00 - 2020-12-09 00:00:00