Home > other >  Cross join within a dataframe given a time range pandas
Cross join within a dataframe given a time range pandas

Time:12-17

I am having a dataframe as below:

ID    GROUP   DATE_MIN               DATE_MAX
1     L1      02/12/2022 6:30AM      02/12/2022 6:35AM
2     L1      02/12/2022 6:33AM      02/12/2022 6:40AM
3     L1      02/12/2022 6:37AM      02/12/2022 6:40AM
4     L2      02/12/2022 7:30AM      02/12/2022 7:35AM
5     L2      02/12/2022 7:33AM      02/12/2022 7:35AM
6     L2      02/12/2022 7:34AM      02/12/2022 7:38AM 

I wanted to count the number of rows per group (GROUP column) between the time range(DATE_MIN, DATE_MAX) output expected is

ID    GROUP   DATE_MIN               DATE_MAX                NumberOfRows
1     L1      02/12/2022 6:30AM      02/12/2022 6:35AM        2 <<because of ID 1 and 2>>
2     L1      02/12/2022 6:33AM      02/12/2022 6:40AM        3 <<because of ID 1, 2 and 3>>
3     L1      02/12/2022 6:37AM      02/12/2022 6:40AM        2 <<because of ID 3 and 2>>
4     L2      02/12/2022 7:30AM      02/12/2022 7:35AM        1 << because of 4 only>>
5     L2      02/12/2022 7:36AM      02/12/2022 7:40AM        2 <<because of 5 and 6>>
6     L2      02/12/2022 7:37AM      02/12/2022 7:40AM        2 <<because of 5 and 6>>

CodePudding user response:

Assuming your groups are not huge (as this solution is building a square matrix per group), you can use :

def count(g):
    # convert to datetime arrays
    d_min = pd.to_datetime(g['DATE_MIN']).to_numpy()
    d_max = pd.to_datetime(g['DATE_MAX']).to_numpy()
    # build the square comparisons to see if the intervals overlap
    # aggregate as sum
    return pd.Series((  (d_min[:,None]<=d_max)
                      & (d_max[:,None]>=d_min)
                     ).sum(axis=0),
                     index=g.index)

df['NumberOfRows'] = df.groupby('GROUP', group_keys=False).apply(count)

NB. efficiency might be better if you initially convert the dates to datetime and remove the conversion from the function.

Output:

   ID GROUP           DATE_MIN           DATE_MAX  NumberOfRows
0   1    L1  02/12/2022 6:30AM  02/12/2022 6:35AM             2
1   2    L1  02/12/2022 6:33AM  02/12/2022 6:40AM             3
2   3    L1  02/12/2022 6:37AM  02/12/2022 6:40AM             2
3   4    L2  02/12/2022 7:30AM  02/12/2022 7:35AM             1
4   5    L2  02/12/2022 7:36AM  02/12/2022 7:40AM             2
5   6    L2  02/12/2022 7:37AM  02/12/2022 7:40AM             2

CodePudding user response:

df['NumberOfRows'] = df.groupby('GROUP')['DATE_MIN'].transform(lambda x: x.between(df['DATE_MIN'], df['DATE_MAX']).sum()) print(df)

  • Related