I have Pandas Data Frame in Python like below ("col1" is in datetime64 data format):
col1
--------
23-11-2020
25-05-2021
...
Moreover I have list of special dates like below (values are as "object" data type in list):
special_dates = ["25.11.2020", "23.11.2020", "01.06.2021", "20.05.2021", ...]
And I need to create 2 more columns in my DataFrame:
- col2 - numer of days until the nearest date from special_dates list
- col3 - numer of days since the lates date from special_dates list
Be aware that some months have 31 or 30 days and in delinquent years, February has a different (28 or 29) number of days
So as a result i need something like below:
col1 | col2 | col3
-----------|------|......
23-11-2020 | 2 | 0
25-05-2021 | 7 | 5
... | ... | ...
How can I do that in Python Pandas ?
CodePudding user response:
Probably not the best/most efficient way, but you can use the days_between
function from this post and then compute the difference between the number of days. This would give you:
import pandas as pd
import numpy as np
from datetime import datetime
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%d-%m-%Y")
d2 = datetime.strptime(d2, "%d.%m.%Y")
return (d2 - d1).days
df = pd.DataFrame({'col1':["23-11-2020", "25-05-2021"]})
special_dates = ["25.11.2020", "23.11.2020", "01.06.2021", "20.05.2021"]
for idx, date in enumerate(df['col1']):
col2=np.inf
col3=np.inf
for special_date in special_dates:
delta = days_between(date, special_date)
if delta >= 0 and delta < col2:
col2 = delta
if delta <= 0 and delta > -col3:
col3 = -delta
df.loc[df.index[idx], 'col2'] = col2
df.loc[df.index[idx], 'col3'] = col3
df.replace(np.inf, np.nan, inplace=True)
df[['col2','col3']].round(0)
CodePudding user response:
vectorial merge:
s = pd.Series(pd.to_datetime(special_dates, dayfirst=True)).sort_values()
df['col1'] = pd.to_datetime(df['col1'], dayfirst=True)
df['col2'] = (pd.merge_asof(df.sort_values(by='col1'), s.rename('other'),
left_on='col1', right_on='other',
direction='forward', allow_exact_matches=True)
['other']
.sub(df['col1']).dt.days
#.fillna(0, downcast='infer') # commented = NaNs if not match
)
df['col3'] = (pd.merge_asof(df.sort_values(by='col1'), s.rename('other'),
left_on='col1', right_on='other',
direction='backward', allow_exact_matches=True)
['other']
.rsub(df['col1']).dt.days
#.fillna(0, downcast='infer') # comment to have NaNs
)
output:
col1 col2 col3
0 2020-11-23 0 0
1 2021-05-25 7 5
older answer (misunderstanding of question)
You can use numpy
broadcasting:
special_dates = ["25.11.2020", "23.11.2020", "01.06.2021", "20.05.2021"]
df['col1'] = pd.to_datetime(df['col1'], dayfirst=True)
a = pd.to_datetime(special_dates, dayfirst=True).to_numpy()
out = (df
.join(pd.DataFrame((a-df['col1'].to_numpy()[:,None]),
index=df.index,
columns=range(1, len(special_dates) 1))
.add_prefix('date_')
.clip('0')
#.apply(lambda c: c.dt.days) # uncomment for days as int
)
)
output:
col1 date_1 date_2 date_3 date_4
0 2020-11-23 2 days 0 days 190 days 178 days
1 2021-05-25 0 days 0 days 7 days 0 days
output as integers (last line uncommented):
col1 date_1 date_2 date_3 date_4
0 2020-11-23 2 0 190 178
1 2021-05-25 0 0 7 0
Variant with dates as headers:
out = (df
.join(pd.DataFrame((a-df['col1'].to_numpy()[:,None]),
index=df.index,
columns=special_dates)
.clip('0')
.apply(lambda c: c.dt.days)
)
)
output:
col1 25.11.2020 23.11.2020 01.06.2021 20.05.2021
0 2020-11-23 2 0 190 178
1 2021-05-25 0 0 7 0