I have a data frame df1
that looks like the following:
StartDate EndDate
2017-01-10 2017-03-31
2017-04-10 2017-06-30
2017-07-24 2017-09-18
2017-09-30 2017-12-04
and df2
that looks like this:
RecordDate Grade
2017-02-10 93
2017-04-10 88
2017-09-18 89
2018-02-30 96
How can I merge the df2['Grade']
to df1
for the rows that has a df2['RecordDate']
between the duration of df1['StartDate']
and df1['EndDate']
so that it can look like this:
StartDate EndDate RecordDate Grade
2017-01-10 2017-03-31 2017-02-10 93
2017-04-10 2017-06-30 2017-04-10 88
2017-07-24 2017-09-18 2017-09-18 89
2017-09-30 2017-12-04 nan nan
I tried the following code:
# removing time for df1 dates
df1['StartDate'] = pd.to_datetime((pd.to_datetime(df1['StartDate']).dt.date)
df1['EndDate'] = pd.to_datetime((pd.to_datetime(df1['EndDate']).dt.date)
df2['RecordDate'] = pd.to_datetime(df2['RecordDate'])
df1.index = pd.IntervalIndex.from_arrays(df1['StartDate'],df1['EndDate'],closed='both')
df1['Grade'] = df2['RecordDate'].apply(lambda x : df1.iloc[df1index.get_loc(x, tolerance = 3)]['Grade'])
But I receive an error that looks like the following:
TypeError: category, object, and string subtypes are not supported for IntervalIndex
CodePudding user response:
you can use pandasql for it
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
q = """SELECT
df.StartDate, df.EndDate,
df2.RecordDate, df2.Grade
FROM
df
LEFT JOIN
df2
ON df2.RecordDate >= df.StartDate
and df2.RecordDate <= df.EndDate;"""
pysqldf(q)
StartDate EndDate RecordDate Grade
0 2017-01-10 2017-03-31 2017-02-10 93.0
1 2017-04-10 2017-06-30 2017-04-10 88.0
2 2017-07-24 2017-09-18 2017-09-18 89.0
3 2017-09-30 2017-12-04 None NaN
CodePudding user response:
Here is a solution using numpy
, which runs faster than pandasql
in python.
# 3 day tolerance when matching
tolerance = np.timedelta64(3,'D')
start = df1.StartDate.values
end = df1.EndDate.values
target = df2.RecordDate.values
i, j = np.where((target[:, None] >= start-tolerance) & (target[:, None] <= end tolerance))
df = pd.concat([
df2.iloc[i, :].reset_index(drop=True),
df1.iloc[j, :].reset_index(drop=True)
], axis=1)
StartDate EndDate RecordDate Grade
0 2017-01-10 2017-03-31 2017-02-10 93.0
1 2017-04-10 2017-06-30 2017-04-10 88.0
2 2017-07-24 2017-09-18 2017-09-18 89.0