Home > OS >  How to check if the dataframe contains 3 previous consecutive dates from a list of dates given per g
How to check if the dataframe contains 3 previous consecutive dates from a list of dates given per g

Time:08-30

Lets say we have a sample dataframe with ID and dates. We also have a date_list = ['2021-12-16', 2021-11-20'] (yyyy-mm-dd)

sample_df: id (str) = unique id dates (datetime) period (str) = 24h is divided into four 6h period. Therefore, period 2 refers to time between 6am - 12 noon for example.

id dates period
aaaa 2021-11-16 2
aaaa 2021-12-16 1
aaaa 2022-09-05 2
bbbb 2021-11-20 1
bbbb 2022-01-22 3
bbbb 2021-09-12 1
bbbb 2022-03-23 3
bbbb 2021-11-21 2
bbbb 2021-11-21 1

For group ID, bbbb I would like to check:

  1. if the dates in the dataframe are three previous consecutive dates from date_list[1] = '2021-11-20'(i.e., Nov 19, 2021, Nov 18, 2021, Nov 17, 2021).

  2. If there are consecutive dates in the 3 day window, check if these consecutive dates have at least 2 periods per day, i.e., two rows with same dates but different periods.

  3. If so, then calculate the length of rows and add to the table as a column. So in this example for bbbb, there should be two rows that fall within 3-day period of the given date and has 2 periods.

Similarly, for aaaa ID, we need to check 1) if the dates are 3 previous consecutive dates from date_list[0] = '2021-12-16'. Repeat same steps as above.

Any help is appreciated!

d = {'id': ['aaaa', 'aaaa', 'aaaa', 'bbbb', 'bbbb', 'bbbb', 'bbbb', 'bbbb', 'bbbb'],
        'dates': ['2021-11-16', '2021-12-16', '2022-09-05', 
                  '2021-11-20', '2022-01-22', '2021-09-12',
                 '2022-03-23', '2021-11-21', '2021-11-21'],
       'period': ['2', '1', '2',
                 '1', '3', '1', 
                 '3', '2', '1']}
sample_df = pd.DataFrame(d)
sample_df

CodePudding user response:

df.dates = pd.to_datetime(df.dates)

df.loc[df.id.eq('aaaa'), 'match'] = date_list[0]
df.loc[df.id.eq('bbbb'), 'match'] = date_list[1]
df.match = pd.to_datetime(df.match)

mask = (df.dates.sub(df.match).abs().between('1d', '3d') &
        df.groupby(['id', 'dates'])['period'].transform('count').ge(2))

print(df[mask])

# Output - Rows that match your conditions:

     id      dates period      match
7  bbbb 2021-11-21      2 2021-11-20
8  bbbb 2021-11-21      1 2021-11-20
  • Related