Home > Software engineering >  Python Pandas- I want to get only columns that meet criteria in past x days- rolling dates
Python Pandas- I want to get only columns that meet criteria in past x days- rolling dates

Time:01-08

So I have a two separate data frames. The first has a list of dates and what international food day it is. Most days have 1 food item, but some can have many.

Date Food
1/1/2022 Cream Puff
1/12/2022 Chicken
3/6/2022 Frozen Food
3/6/2022 Oreo

My second dataframe includes a list of people, foods, and the dates they ate the food

Date Person Food
12/29/2021 Jack Cream Puff
12/30/2021 Pete Cream Puff
1/12/2022 Jill Jello
2/6/2022 Jill Oreo
2/3/2022 Sara Oreo
3/6/2022 Joel Chicken

My goal is for each international food day to pull back everyone who ate that the day of the food day OR within the 5 days prior. I recognize that I can do a for loop with the food day dataframe and then try to find anyone between the date range and store them in a temporary data frame. However, my data set is almost 1 GB for both so the looping is taking forever. Plus looping on data frames is not ideal. Anyone have any guidance for how to do this without a loop?

CodePudding user response:

You will have to evaluate the performance on the 1GB dataset you are referring to, but this is a method in pandas that would alleviate looping outside of the df.

import pandas as pd

food = pd.DataFrame({'Date': ['1/1/2022',
                              '1/12/2022',
                              '3/6/2022',
                              '3/6/2022'],
                    'Food':['Cream Puff',
                            'Chicken',
                            'Frozen Food',
                            'Oreo']})

person_food = pd.DataFrame({'Date':['12/29/2021',
                                    '12/30/2021',
                                    '1/12/2022',
                                    '2/6/2022',
                                    '2/3/2022',
                                    '3/6/2022'],
                            'Food':['Cream Puff',
                                    'Cream Puff',
                                    'Jello',
                                    'Oreo',
                                    'Oreo',
                                    'Chicken'],
                            'Person':['Jack',
                                      'Pete',
                                      'Jill',
                                      'Jill',
                                      'Sara',
                                      'Joel']})

## Create a column for five days prior in your food df
food['end_date'] = pd.to_datetime(food['Date'])
food['start_date'] = food['end_date'] - pd.Timedelta(days = 5)
person_food['ate_date'] = pd.to_datetime(person_food['Date'])

## Join the dfs where the date the person ate the food was between the desired dates 
df = pd.merge(food,
              person_food,
              how = 'left',
              on = 'Food',
              )\
    .where((df['ate_date']<=df['end_date']) & (df['ate_date']>=df['start_date']))

## return the names of the people where the Person is not null
df[df['Person'].isna()==False]['Person']

CodePudding user response:

You could try using pd.merge_asof which allows you to merge on datetime index allowing for a tolerance parameter.

df1 = {'Date': {0: Timestamp('2022-01-01 00:00:00'),
  1: Timestamp('2022-01-12 00:00:00'),
  2: Timestamp('2022-03-06 00:00:00'),
  3: Timestamp('2022-03-06 00:00:00')},
 'Food': {0: 'Cream Puff', 1: 'Chicken', 2: 'Frozen Food', 3: 'Oreo'}}

df2 = {'Date': {0: Timestamp('2021-12-29 00:00:00'),
  1: Timestamp('2021-12-30 00:00:00'),
  2: Timestamp('2022-01-12 00:00:00'),
  3: Timestamp('2022-02-06 00:00:00'),
  4: Timestamp('2022-02-03 00:00:00'),
  5: Timestamp('2022-03-06 00:00:00')},
 'Person': {0: 'Jack', 1: 'Pete', 2: 'Jill', 3: 'Jill', 4: 'Sara', 5: 'Joel'},
 'Food': {0: 'Cream Puff',
  1: 'Cream Puff',
  2: 'Jello',
  3: 'Oreo',
  4: 'Oreo',
  5: 'Chicken'}}


# Convert the 'Date' columns to datetime
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])

# Use pd.merge_asof 
out = pd.merge_asof(df2.sort_values(by='Date'), df1.sort_values(by='Date'), 
              on='Date', by='Food',
              tolerance=pd.Timedelta(days=5), direction='backward')

>>>out
        Date Person        Food
0 2021-12-29   Jack  Cream Puff
1 2021-12-30   Pete  Cream Puff
2 2022-01-12   Jill       Jello
3 2022-02-03   Sara        Oreo
4 2022-02-06   Jill        Oreo
5 2022-03-06   Joel     Chicken

The dataframe returns all the names, dates, and foods eaten within 5 days of the Food-day

  • Related