So I have a two separate data frames. The first has a list of dates and what international food day it is. Most days have 1 food item, but some can have many.
Date | Food |
---|---|
1/1/2022 | Cream Puff |
1/12/2022 | Chicken |
3/6/2022 | Frozen Food |
3/6/2022 | Oreo |
My second dataframe includes a list of people, foods, and the dates they ate the food
Date | Person | Food |
---|---|---|
12/29/2021 | Jack | Cream Puff |
12/30/2021 | Pete | Cream Puff |
1/12/2022 | Jill | Jello |
2/6/2022 | Jill | Oreo |
2/3/2022 | Sara | Oreo |
3/6/2022 | Joel | Chicken |
My goal is for each international food day to pull back everyone who ate that the day of the food day OR within the 5 days prior. I recognize that I can do a for loop with the food day dataframe and then try to find anyone between the date range and store them in a temporary data frame. However, my data set is almost 1 GB for both so the looping is taking forever. Plus looping on data frames is not ideal. Anyone have any guidance for how to do this without a loop?
CodePudding user response:
You will have to evaluate the performance on the 1GB dataset you are referring to, but this is a method in pandas that would alleviate looping outside of the df.
import pandas as pd
food = pd.DataFrame({'Date': ['1/1/2022',
'1/12/2022',
'3/6/2022',
'3/6/2022'],
'Food':['Cream Puff',
'Chicken',
'Frozen Food',
'Oreo']})
person_food = pd.DataFrame({'Date':['12/29/2021',
'12/30/2021',
'1/12/2022',
'2/6/2022',
'2/3/2022',
'3/6/2022'],
'Food':['Cream Puff',
'Cream Puff',
'Jello',
'Oreo',
'Oreo',
'Chicken'],
'Person':['Jack',
'Pete',
'Jill',
'Jill',
'Sara',
'Joel']})
## Create a column for five days prior in your food df
food['end_date'] = pd.to_datetime(food['Date'])
food['start_date'] = food['end_date'] - pd.Timedelta(days = 5)
person_food['ate_date'] = pd.to_datetime(person_food['Date'])
## Join the dfs where the date the person ate the food was between the desired dates
df = pd.merge(food,
person_food,
how = 'left',
on = 'Food',
)\
.where((df['ate_date']<=df['end_date']) & (df['ate_date']>=df['start_date']))
## return the names of the people where the Person is not null
df[df['Person'].isna()==False]['Person']
CodePudding user response:
You could try using pd.merge_asof
which allows you to merge on datetime index allowing for a tolerance parameter.
df1 = {'Date': {0: Timestamp('2022-01-01 00:00:00'),
1: Timestamp('2022-01-12 00:00:00'),
2: Timestamp('2022-03-06 00:00:00'),
3: Timestamp('2022-03-06 00:00:00')},
'Food': {0: 'Cream Puff', 1: 'Chicken', 2: 'Frozen Food', 3: 'Oreo'}}
df2 = {'Date': {0: Timestamp('2021-12-29 00:00:00'),
1: Timestamp('2021-12-30 00:00:00'),
2: Timestamp('2022-01-12 00:00:00'),
3: Timestamp('2022-02-06 00:00:00'),
4: Timestamp('2022-02-03 00:00:00'),
5: Timestamp('2022-03-06 00:00:00')},
'Person': {0: 'Jack', 1: 'Pete', 2: 'Jill', 3: 'Jill', 4: 'Sara', 5: 'Joel'},
'Food': {0: 'Cream Puff',
1: 'Cream Puff',
2: 'Jello',
3: 'Oreo',
4: 'Oreo',
5: 'Chicken'}}
# Convert the 'Date' columns to datetime
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
# Use pd.merge_asof
out = pd.merge_asof(df2.sort_values(by='Date'), df1.sort_values(by='Date'),
on='Date', by='Food',
tolerance=pd.Timedelta(days=5), direction='backward')
>>>out
Date Person Food
0 2021-12-29 Jack Cream Puff
1 2021-12-30 Pete Cream Puff
2 2022-01-12 Jill Jello
3 2022-02-03 Sara Oreo
4 2022-02-06 Jill Oreo
5 2022-03-06 Joel Chicken
The dataframe returns all the names, dates, and foods eaten within 5 days of the Food-day