I am having a dataframe as below:
ID GROUP DATE_MIN DATE_MAX
1 L1 02/12/2022 6:30AM 02/12/2022 6:35AM
2 L1 02/12/2022 6:33AM 02/12/2022 6:40AM
3 L1 02/12/2022 6:37AM 02/12/2022 6:40AM
4 L2 02/12/2022 7:30AM 02/12/2022 7:35AM
5 L2 02/12/2022 7:33AM 02/12/2022 7:35AM
6 L2 02/12/2022 7:34AM 02/12/2022 7:38AM
I wanted to count the number of rows per group (GROUP column) between the time range(DATE_MIN, DATE_MAX) output expected is
ID GROUP DATE_MIN DATE_MAX NumberOfRows
1 L1 02/12/2022 6:30AM 02/12/2022 6:35AM 2 <<because of ID 1 and 2>>
2 L1 02/12/2022 6:33AM 02/12/2022 6:40AM 3 <<because of ID 1, 2 and 3>>
3 L1 02/12/2022 6:37AM 02/12/2022 6:40AM 2 <<because of ID 3 and 2>>
4 L2 02/12/2022 7:30AM 02/12/2022 7:35AM 1 << because of 4 only>>
5 L2 02/12/2022 7:36AM 02/12/2022 7:40AM 2 <<because of 5 and 6>>
6 L2 02/12/2022 7:37AM 02/12/2022 7:40AM 2 <<because of 5 and 6>>
CodePudding user response:
Assuming your groups are not huge (as this solution is building a square matrix per group), you can use numpy:
def count(g):
# convert to datetime arrays
d_min = pd.to_datetime(g['DATE_MIN']).to_numpy()
d_max = pd.to_datetime(g['DATE_MAX']).to_numpy()
# build the square comparisons to see if the intervals overlap
# aggregate as sum
return pd.Series(( (d_min[:,None]<=d_max)
& (d_max[:,None]>=d_min)
).sum(axis=0),
index=g.index)
df['NumberOfRows'] = df.groupby('GROUP', group_keys=False).apply(count)
NB. efficiency might be better if you initially convert the dates to datetime and remove the conversion from the function.
Output:
ID GROUP DATE_MIN DATE_MAX NumberOfRows
0 1 L1 02/12/2022 6:30AM 02/12/2022 6:35AM 2
1 2 L1 02/12/2022 6:33AM 02/12/2022 6:40AM 3
2 3 L1 02/12/2022 6:37AM 02/12/2022 6:40AM 2
3 4 L2 02/12/2022 7:30AM 02/12/2022 7:35AM 1
4 5 L2 02/12/2022 7:36AM 02/12/2022 7:40AM 2
5 6 L2 02/12/2022 7:37AM 02/12/2022 7:40AM 2
CodePudding user response:
df['NumberOfRows'] = df.groupby('GROUP')['DATE_MIN'].transform(lambda x: x.between(df['DATE_MIN'], df['DATE_MAX']).sum()) print(df)