Home > Software design >  How to calculate number of days until the nearest and since the datest date from dates in list in Py
How to calculate number of days until the nearest and since the datest date from dates in list in Py

Time:07-18

I have Pandas Data Frame in Python like below ("col1" is in datetime64 data format):

col1
--------
23-11-2020
25-05-2021
...

Moreover I have list of special dates like below (values are as "object" data type in list):

special_dates = ["25.11.2020", "23.11.2020", "01.06.2021", "20.05.2021", ...] 

And I need to create 2 more columns in my DataFrame:

  • col2 - numer of days until the nearest date from special_dates list
  • col3 - numer of days since the lates date from special_dates list

Be aware that some months have 31 or 30 days and in delinquent years, February has a different (28 or 29) number of days

So as a result i need something like below:

col1       | col2 | col3
-----------|------|......
23-11-2020 | 2    | 0
25-05-2021 | 7    | 5
...        | ...  | ...

How can I do that in Python Pandas ?

CodePudding user response:

Probably not the best/most efficient way, but you can use the days_between function from this post and then compute the difference between the number of days. This would give you:

import pandas as pd
import numpy as np
from datetime import datetime

def days_between(d1, d2):
    d1 = datetime.strptime(d1, "%d-%m-%Y")
    d2 = datetime.strptime(d2, "%d.%m.%Y")
    return (d2 - d1).days

df = pd.DataFrame({'col1':["23-11-2020", "25-05-2021"]})
special_dates = ["25.11.2020", "23.11.2020", "01.06.2021", "20.05.2021"] 

for idx, date in enumerate(df['col1']):
    col2=np.inf
    col3=np.inf
    for special_date in special_dates:
        delta = days_between(date, special_date)
        if delta >= 0 and delta < col2:
            col2 = delta
        if delta <= 0 and delta > -col3:
            col3 = -delta

    df.loc[df.index[idx], 'col2'] = col2
    df.loc[df.index[idx], 'col3'] = col3

df.replace(np.inf, np.nan, inplace=True)
df[['col2','col3']].round(0)

CodePudding user response:

vectorial merge:

s = pd.Series(pd.to_datetime(special_dates, dayfirst=True)).sort_values()
df['col1'] = pd.to_datetime(df['col1'], dayfirst=True)

df['col2'] = (pd.merge_asof(df.sort_values(by='col1'), s.rename('other'),
                            left_on='col1', right_on='other',
                            direction='forward', allow_exact_matches=True)
                ['other']
                .sub(df['col1']).dt.days
                #.fillna(0, downcast='infer') # commented = NaNs if not match
             )

df['col3'] = (pd.merge_asof(df.sort_values(by='col1'), s.rename('other'),
                            left_on='col1', right_on='other',
                            direction='backward', allow_exact_matches=True)
                 ['other']
                .rsub(df['col1']).dt.days
                #.fillna(0, downcast='infer') # comment to have NaNs
             )

output:

        col1  col2  col3
0 2020-11-23     0     0
1 2021-05-25     7     5

older answer (misunderstanding of question)

You can use numpy broadcasting:

special_dates = ["25.11.2020", "23.11.2020", "01.06.2021", "20.05.2021"] 

df['col1'] = pd.to_datetime(df['col1'], dayfirst=True)

a = pd.to_datetime(special_dates, dayfirst=True).to_numpy()
out = (df
       .join(pd.DataFrame((a-df['col1'].to_numpy()[:,None]),
                          index=df.index,
                          columns=range(1, len(special_dates) 1))
               .add_prefix('date_')
               .clip('0')
               #.apply(lambda c: c.dt.days) # uncomment for days as int
             )
      )

output:

        col1 date_1 date_2   date_3   date_4
0 2020-11-23 2 days 0 days 190 days 178 days
1 2021-05-25 0 days 0 days   7 days   0 days

output as integers (last line uncommented):

        col1  date_1  date_2  date_3  date_4
0 2020-11-23       2       0     190     178
1 2021-05-25       0       0       7       0

Variant with dates as headers:

out = (df
       .join(pd.DataFrame((a-df['col1'].to_numpy()[:,None]),
                          index=df.index,
                          columns=special_dates)
               .clip('0')
               .apply(lambda c: c.dt.days)
             )
      )

output:

        col1  25.11.2020  23.11.2020  01.06.2021  20.05.2021
0 2020-11-23           2           0         190         178
1 2021-05-25           0           0           7           0
  • Related