Home > Software engineering >  Get earliest date from dataframe columns
Get earliest date from dataframe columns

Time:05-03

I just started working with DataFrames and i'm struggling with a request:

I have a df with this structure:

id req_date             spec_date
1  2021-09-28T07:19:43  2021-09-24T07:17:02
2  2021-04-28T07:19:43  
3                       2021-02-21T07:14:41

I'm trying to create a new column for which the values to be the earliest dates between the req_date and spec_date. It's working as expected when both column have values , but if one of them is empty then the new column value will also be empty, how do i treat this case?

CodePudding user response:

You can try pandas.DataFrame.min along columns with skipna=False

df['col'] = (df[['req_date', 'spec_date']]
             .apply(pd.to_datetime)
             .min(axis=1, skipna=False)
             .dt.strftime('%Y-%m-%dT%H:%M:%S')
)
print(df)

   id             req_date            spec_date                  col
0   1  2021-09-28T07:19:43  2021-09-24T07:17:02  2021-09-24T07:17:02
1   2  2021-04-28T07:19:43                 None                  NaN
2   3                 None  2021-02-21T07:14:41                  NaN

Depending if you want to skip NAN, you can adjust skipna

df['col'] = (df[['req_date', 'spec_date']]
             .apply(pd.to_datetime)
             .min(axis=1)   # <-- difference here
             .dt.strftime('%Y-%m-%dT%H:%M:%S')
)
print(df)

   id             req_date            spec_date                  col
0   1  2021-09-28T07:19:43  2021-09-24T07:17:02  2021-09-24T07:17:02
1   2  2021-04-28T07:19:43                 None  2021-04-28T07:19:43
2   3                 None  2021-02-21T07:14:41  2021-02-21T07:14:41

CodePudding user response:

Since you are looking for minimum of two dates, you could substitute missing values with technical date like '9999-12-31T00:00:00'. It will make further comparison easier (no need for handling with or checking nulls).

# -*- coding: utf-8 -*-
import pandas as pd

# generating a sample dataframe
df = pd.DataFrame([[1, '2021-09-28T07:19:43', '2021-09-24T07:17:02'],
                   [2, '2021-04-28T07:19:43', None],
                   [3, None, '2021-02-21T07:14:41']],
                  columns=['id', 'req_date', 'spec_date'])

# printing content of dataframe
print(df)

df['req_date'] = df['req_date'].fillna('9999-12-31T00:00:00')
df['spec_date']= df['spec_date'].fillna('9999-12-31T00:00:00')

print(df)

CodePudding user response:

While @Ynjxsjmh's answer is the cleanest way to go, you could also take advantage of your date format that can be sorted as string to get the min without conversion.

First fillna with a string that will sort after a number, for example 'NaT' (or 'X'), then get the min:

df['min'] = df[['req_date', 'spec_date']].fillna('NaT').min(axis=1)

If you already have empty strings use replace instead:

df['min'] = df[['req_date', 'spec_date']].replace('', 'NaT').min(axis=1)

output:

   id             req_date            spec_date                  min
0   1  2021-09-28T07:19:43  2021-09-24T07:17:02  2021-09-24T07:17:02
1   2  2021-04-28T07:19:43                  NaN  2021-04-28T07:19:43
2   3                  NaN  2021-02-21T07:14:41  2021-02-21T07:14:41
  • Related