Home > Back-end >  How to group by column and a fixed time window/frequency
How to group by column and a fixed time window/frequency

Time:10-20

EDIT: My main goal is not to use a for loop and find a way of grouping the data efficiently/fast.

I am trying to solve a problem, which is about grouping together different rows of data based on an ID and a time window of 30 Days.

I have the following example data:

ID Time
12345 2021-01-01 14:00:00
12345 2021-01-15 14:00:00
12345 2021-01-29 14:00:00
12345 2021-02-15 14:00:00
12345 2021-02-16 14:00:00
12345 2021-03-15 14:00:00
12345 2021-04-24 14:00:00
12344 2021-01-24 14:00:00
12344 2021-01-25 14:00:00
12344 2021-04-24 14:00:00

And I would like to have the following data:

ID Time Group
12345 2021-01-01 14:00:00 1
12345 2021-01-15 14:00:00 1
12345 2021-01-29 14:00:00 1
12345 2021-02-15 14:00:00 2
12345 2021-02-16 14:00:00 2
12345 2021-03-15 14:00:00 3
12345 2021-04-24 14:00:00 4
12344 2021-01-24 14:00:00 5
12344 2021-01-25 14:00:00 5
12344 2021-04-24 14:00:00 6

(4 can also be 1 as it is in a new group based on the ID 12344; 5 can also be 2) I could differentiate then based on the ID column. So the Group does not need to be unique but can be.

The most important would be to separate it based on the ID and then check all the rows for each ID and assign an ID to the 30 Days time window. By 30 Days time window I mean that e.g. the first time frame for ID 12345 starts at 2021-01-01 and goes up to 2021-01-31 (this should be the group 1) and then the second time time frame for the ID 12345 starts at 2021-02-01 and would go to 2021-03-02 (for 30 days).

The problem I have faced with using the following code is that it uses the first date it finds in the dataframe:

grouped_data = df.groupby(["ID",pd.Grouper(key = "Time", freq = "30D")]).count()

In the above code I have just tried to count the rows (which wouldn't give me the Group, but I have tried to group it with my logic).

I hope someone can help me with this, because I have tried so many different things and nothing did work. I have already used the following (but maybe wrong):

  • pd.rolling()
  • pd.Grouper()
  • for loop
  • etc.

I really don't want to use for loop as I have 1.5 Mio rows. And I have tried to vectorize the for loop but I am not really familiar with vectorization and was struggling to transfer my for loop to a vectorization.

Please let me know if I can use pd.Grouper differently so I get the results. thanks in advance.

CodePudding user response:

For arbitrary windows you can use pandas.cut

eg, for 30 day bins starting at 2021-01-01 00:00:00 for the entirety of 2021 you can use:

bins =  pd.date_range("2021", "2022", freq="30D")
group = pd.cut(df["Time"], bins)

group will label each row with an interval which you can then group on etc. If you want the groups to have labels 0, 1, 2, etc then you can map values with:

dict(zip(group.unique(), range(group.nunique())))

EDIT: approach where the windows are 30 day intervals, disjoint, and starting at a time in the Time column:

times = df["Time"].sort_values()
ii = pd.IntervalIndex.from_arrays(times, times pd.Timedelta("30 days"))

disjoint_intervals = []
prev_interval = None

for i, interval in enumerate(ii):
    if prev_interval is None or interval.left >= prev_interval.right:  # no overlap
        prev_interval = interval 
        disjoint_intervals.append(i)

bins =  ii[disjoint_intervals]
group = pd.cut(df["Time"], bins)

Apologies, this is not a vectorised approach. Struggling to think if one could exist.

CodePudding user response:

SOLUTION: The solution which worked for me is the following:

I have imported the sampleData from excel into a dataframe. The data looks like this:

ID Time
12345 2021-01-01 14:00:00
12345 2021-01-15 14:00:00
12345 2021-01-29 14:00:00
12345 2021-02-15 14:00:00
12345 2021-02-16 14:00:00
12345 2021-03-15 14:00:00
12345 2021-04-24 14:00:00
12344 2021-01-24 14:00:00
12344 2021-01-25 14:00:00
12344 2021-04-24 14:00:00

Then I have used the following steps:

  1. Import the data:
df_test = pd.read_excel(r"sampleData.xlsx")
  1. Order the dataframe so we have the correct order of ID and Time:
df_test_ordered = df_test.sort_values(["ID","Time"])
df_test_ordered = df_test_ordered.reset_index(drop=True)

I have also reset the index and dropped it as it has manipulated my calculations later on.

  1. Create column with time difference between the previous row:
df_test_ordered.loc[df_test_ordered["ID"] == df_test_ordered["ID"].shift(1),"time_diff"] = df_test_ordered["Time"] - df_test_ordered["Time"].shift(1)
  1. Transform timedelta64[ns] to timedelta64[D]:
df_test_ordered["time_diff"] = df_test_ordered["time_diff"].astype("timedelta64[D]")
  1. Calculate the cumsum per ID:
df_test_ordered["cumsum"] = df_test_ordered.groupby("ID")["time_diff"].transform(pd.Series.cumsum)
  1. Backfill the dataframe (exchange the NaN values with the next value):
df_final = df_test_ordered.ffill().bfill()
  1. Create the window by dividing by 30 (30 days time period):
df_final["Window"] = df_final["cumsum"] / 30
df_final["Window_int"] = df_final["Window"].astype(int)

The "Window_int" column is now a kind of ID (not unique; but unique within the groups of column "ID").

Furthermore, I needed to backfill the dataframe as there were NaN values due to the calculation of time difference only if the previous ID equals the ID. If not then NaN is set as time difference. Backfilling will just set the NaN value to the next time difference which makes no difference mathematically and assign the correct value.

Solution dataframe:

      ID                Time  time_diff  cumsum    Window  Window_int
0  12344 2021-01-24 14:00:00        1.0     1.0  0.032258           0
1  12344 2021-01-25 14:00:00        1.0     1.0  0.032258           0
2  12344 2021-04-24 14:00:00       89.0    90.0  2.903226           2
3  12345 2021-01-01 14:00:00       14.0    14.0  0.451613           0
4  12345 2021-01-15 14:00:00       14.0    14.0  0.451613           0
5  12345 2021-01-29 14:00:00       14.0    28.0  0.903226           0
6  12345 2021-02-15 14:00:00       17.0    45.0  1.451613           1
7  12345 2021-02-16 14:00:00        1.0    46.0  1.483871           1
8  12345 2021-03-15 14:00:00       27.0    73.0  2.354839           2
9  12345 2021-04-24 14:00:00       40.0   113.0  3.645161           3
  • Related