I just started working with DataFrames and i'm struggling with a request:
I have a df with this structure:
id req_date spec_date
1 2021-09-28T07:19:43 2021-09-24T07:17:02
2 2021-04-28T07:19:43
3 2021-02-21T07:14:41
I'm trying to create a new column for which the values to be the earliest dates between the req_date and spec_date. It's working as expected when both column have values , but if one of them is empty then the new column value will also be empty, how do i treat this case?
CodePudding user response:
You can try pandas.DataFrame.min
along columns with skipna=False
df['col'] = (df[['req_date', 'spec_date']]
.apply(pd.to_datetime)
.min(axis=1, skipna=False)
.dt.strftime('%Y-%m-%dT%H:%M:%S')
)
print(df)
id req_date spec_date col
0 1 2021-09-28T07:19:43 2021-09-24T07:17:02 2021-09-24T07:17:02
1 2 2021-04-28T07:19:43 None NaN
2 3 None 2021-02-21T07:14:41 NaN
Depending if you want to skip NAN
, you can adjust skipna
df['col'] = (df[['req_date', 'spec_date']]
.apply(pd.to_datetime)
.min(axis=1) # <-- difference here
.dt.strftime('%Y-%m-%dT%H:%M:%S')
)
print(df)
id req_date spec_date col
0 1 2021-09-28T07:19:43 2021-09-24T07:17:02 2021-09-24T07:17:02
1 2 2021-04-28T07:19:43 None 2021-04-28T07:19:43
2 3 None 2021-02-21T07:14:41 2021-02-21T07:14:41
CodePudding user response:
Since you are looking for minimum of two dates, you could substitute missing values with technical date like '9999-12-31T00:00:00'. It will make further comparison easier (no need for handling with or checking nulls).
# -*- coding: utf-8 -*-
import pandas as pd
# generating a sample dataframe
df = pd.DataFrame([[1, '2021-09-28T07:19:43', '2021-09-24T07:17:02'],
[2, '2021-04-28T07:19:43', None],
[3, None, '2021-02-21T07:14:41']],
columns=['id', 'req_date', 'spec_date'])
# printing content of dataframe
print(df)
df['req_date'] = df['req_date'].fillna('9999-12-31T00:00:00')
df['spec_date']= df['spec_date'].fillna('9999-12-31T00:00:00')
print(df)
CodePudding user response:
While @Ynjxsjmh's answer is the cleanest way to go, you could also take advantage of your date format that can be sorted as string to get the min without conversion.
First fillna
with a string that will sort after a number, for example 'NaT'
(or 'X'
), then get the min
:
df['min'] = df[['req_date', 'spec_date']].fillna('NaT').min(axis=1)
If you already have empty strings use replace
instead:
df['min'] = df[['req_date', 'spec_date']].replace('', 'NaT').min(axis=1)
output:
id req_date spec_date min
0 1 2021-09-28T07:19:43 2021-09-24T07:17:02 2021-09-24T07:17:02
1 2 2021-04-28T07:19:43 NaN 2021-04-28T07:19:43
2 3 NaN 2021-02-21T07:14:41 2021-02-21T07:14:41