Home > Enterprise >  Given a dataframe with event details, return a count of events that occured on any given date, based
Given a dataframe with event details, return a count of events that occured on any given date, based

Time:01-27

What I have and am trying to do:

A dataframe, with headers: event_id, location_id, start_date, end_date.

An event can only have one location, start and end. A location can have multiple events, starts and ends, and they can overlap.

The goal here is to be able to say, given any time T, for location X, how many events were there?

E.g.

Given three events, all for location 2:

Event. Start. End.
Event 1 2022-05-01 2022-05-07
Event 2 2022-05-04 2022-05-10
Event 3 2022-05-02 2022-05-05
Time T. Count of Events
2022-05-01 1
2022-05-02 2
2022-05-03 2
2022-05-04 3
2022-05-05 3
2022-05-06 2

**What I have tried so far, but got stuck on: **

((I did look at THIS possible solution for a similar problem, and I went pretty far with it, but I got lost in the itterows and how to have that apply here.))

Try to get an array or dataframe that has a 365 day date range for each location ID.

E.g.

[1,2022-01-01],[1,2022-01-02]........[98,2022-01-01][98,2022-01-02]

Then convert that array to a dataframe, and merge it with the original dataframe like:

index location time event location2 start end
0 1 2022-01-01 1 10 2022-11-07 2022-11-12
1 1 2022-01-01 2 4 2022-02-16 2022-03-05
2 1 2022-01-01 3 99 2022-06-10 2022-06-15
3 1 2022-01-01 4 2 2021-12-31 2022-01-05
4 1 2022-01-01 5 5 2022-05-08 2022-05-22

Then perform some kind of reduction that returns the count:

location Time Count
1 2022-01-01 10
1 2022-01-02 3
1 2022-01-03 13
.. ... ...
99 2022-01-01 4
99 2022-01-02 0
99 2022-01-03 7
99 2022-01-04 12

I've done something similar with tying events to other events where their dates overalapped, using the .loc(...) but I don't think that would work here, and I'm kind of just stumped.

Where I got stuck was creating an array that combines the location ID and date range, because they're different lengths, and I could figure out how to repeat the location ID for every date in the range.

Anyway, I am 99% positive that there is a much more efficient way of doing this, and really any help at all is greatly appreciated!!

Thank you :)

CodePudding user response:

Update per comment

# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])

# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the location of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
                       'Location': [df[df['Start.'].le(date) & df['End.'].ge(date)]['Event.'].tolist()
                                    for date in date_range]})
# get the length of each list, which is the count
new_df['Count'] = new_df['Location'].str.len()

        Date                     Location  Count
0 2022-05-01                    [Event 1]      1
1 2022-05-02           [Event 1, Event 3]      2
2 2022-05-03           [Event 1, Event 3]      2
3 2022-05-04  [Event 1, Event 2, Event 3]      3
4 2022-05-05  [Event 1, Event 2, Event 3]      3
5 2022-05-06           [Event 1, Event 2]      2
6 2022-05-07           [Event 1, Event 2]      2
7 2022-05-08                    [Event 2]      1
8 2022-05-09                    [Event 2]      1
9 2022-05-10                    [Event 2]      1

IIUC you can try something like

# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
# df.le is less than or equal to
# df.ge is greater than or equal to
new_df = pd.DataFrame({'Date': date_range,
                       'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
                                 for date in date_range]})

        Date  Count
0 2022-05-01      1
1 2022-05-02      2
2 2022-05-03      2
3 2022-05-04      3
4 2022-05-05      3
5 2022-05-06      2
6 2022-05-07      2
7 2022-05-08      1
8 2022-05-09      1
9 2022-05-10      1

Depending on how large your date range is we may need to take a different approach as things may get slow if you have a range of two years instead of 10 days in the example.

You can also use a custom date range if you do not want to use the min and max values from the whole frame

min_date = '2022-05-01'
max_date = '2022-05-06'
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
                       'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
                                 for date in date_range]})

        Date  Count
0 2022-05-01      1
1 2022-05-02      2
2 2022-05-03      2
3 2022-05-04      3
4 2022-05-05      3
5 2022-05-06      2

CodePudding user response:

Note - I wanted to leave the original question up as is, and I was out of space, so I am answering my own question here, but @It_is_Chris is the real MVP.

Update! - with the enormous help from @It_is_Chris and some additional messing around, I was able to use the following code to generate the output I wanted:

# get the min and max dates
min_date, max_date = original_df[['start', 'end']].stack().agg([min, max])

# create a date range
date_range = pd.date_range(min_date, max_date)

# create location range
loc_range = original_df['location'].unique()

# create a new list that combines every date with every location
combined_list = []
for item in date_range:
    for location in loc_range:
        combined_list.append(
            {
                'Date':item,
                'location':location
            }
        )

# convert the list to a dataframe
combined_df = pd.DataFrame(combined_list)

# use merge to put original data together with the new dataframe
merged_df = pd.merge(combined_df,original_df, how="left", on="location")

# use loc to directly connect each event to a specific location and time
merged_df = merged_df.loc[(pd.to_datetime(merged_df['Date'])>=pd.to_datetime(merged_df['start'])) & (pd.to_datetime(merged_df['Date'])<=pd.to_datetime(merged_df['end']))]

# use groupby to push out a table as sought Date - Location - Count
output_merged_df = merged_df.groupby(['Date','fleet_id']).size()

The result looked like this:

Note - the sorting was not as I have it here, I believe I would need to add some additional sorting to the dataframe before outputting as a CSV.

Date location count
2022-01-01 1 1
2022-01-01 2 4
2022-01-01 3 1
2022-01-01 4 10
2022-01-01 5 3
2022-01-01 6 1
  • Related