I have a dataframe like this:
data = {'SalePrice':[10,10,10,20,20,3,3,1,4,8,8],'HandoverDateA':['2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-03-30','2022-03-30'],'ID': ['Tom', 'Tom','Tom','Joseph','Joseph','Ben','Ben','Eden','Tim','Adam','Adam'], 'Tranche': ['Red', 'Red', 'Red', 'Red','Red','Blue','Blue','Red','Red','Red','Red'],'Totals':[100,100,100,50,50,90,90,70,60,70,70],'Sent':['2022-01-18','2022-02-19','2022-03-14','2022-03-14','2022-04-22','2022-03-03','2022-02-07','2022-01-04','2022-01-10','2022-01-15','2022-03-12'],'Amount':[20,10,14,34,15,60,25,10,10,40,20],'Opened':['2021-12-29','2021-12-29','2021-12-29','2022-12-29','2022-12-29','2021-12-19','2021-12-19','2021-12-29','2021-12-29','2021-12-29','2021-12-29']}
I need to find the sent date which is closest to the HandoverDate. I've seen plenty of examples that work when you give one date to search but here the date I want to be closest to can change for every ID. I have tried to adapt the following:
def nearest(items, pivot):
return min([i for i in items if i <= pivot], key=lambda x: abs(x - pivot))
And also tried to write a loop where I make a dataframe for each ID and use max on the date column then stick them together, but it's incredibly slow!
Thanks for any suggestions :)
CodePudding user response:
IIUC, you can use:
data[['HandoverDateA', 'Sent']] = data[['HandoverDateA', 'Sent']].apply(pd.to_datetime)
out = data.loc[data['HandoverDateA']
.sub(data['Sent']).abs()
.groupby(data['ID']).idxmin()]
Output:
SalePrice HandoverDateA ID Tranche Totals Sent Amount Opened
10 8 2022-03-30 Adam Red 70 2022-03-12 20 2021-12-29
5 3 2022-04-30 Ben Blue 90 2022-03-03 60 2021-12-19
7 1 2022-04-30 Eden Red 70 2022-01-04 10 2021-12-29
4 20 2022-04-30 Joseph Red 50 2022-04-22 15 2022-12-29
8 4 2022-04-30 Tim Red 60 2022-01-10 10 2021-12-29
2 10 2022-04-30 Tom Red 100 2022-03-14 14 2021-12-29
CodePudding user response:
Considering that the goal is to find the sent date which is closest to the HandoverDate
, one approach would be as follows.
First of all, create the dataframe df
from data
with pandas.DataFrame
import pandas as pd
df = pd.DataFrame(data)
Then, make sure that the columns HandoverDateA
and Sent
are of datetime using pandas.to_datetime
df['HandoverDateA'] = pd.to_datetime(df['HandoverDateA'])
df['Sent'] = pd.to_datetime(df['Sent'])
Then, in order to make it more convenient, create a column, diff
, to store the absolute value of the difference between the columns HandoverDateA
and Sent
df['diff'] = (df['HandoverDateA'] - df['Sent']).dt.days.abs()
With that column, one can simply sort by that column as follows
df = df.sort_values(by=['diff'])
[Out]:
SalePrice HandoverDateA ID ... Amount Opened diff
4 20 2022-04-30 Joseph ... 15 2022-12-29 8
10 8 2022-03-30 Adam ... 20 2021-12-29 18
2 10 2022-04-30 Tom ... 14 2021-12-29 47
5 3 2022-04-30 Ben ... 60 2021-12-19 58
8 4 2022-04-30 Tim ... 10 2021-12-29 110
7 1 2022-04-30 Eden ... 10 2021-12-29 116
and the first row is the one where Sent
is closest to HandOverDateA
.
With the column diff
, one option to get the one where diff
is minimum is with pandas.DataFrame.query
as follows
df = df.query('diff == diff.min()')
[Out]:
SalePrice HandoverDateA ID Tranche ... Sent Amount Opened diff
4 20 2022-04-30 Joseph Red ... 2022-04-22 15 2022-12-29 8
Notes:
- For more information on sorting dataframes by columns, read my answer here.