How do I convert a dataframe with time-periods into a particular format?-CodePudding

My initial dataframe looks as follows:

User	License	Status	Start-Date	End-Date
A	xy	access	10.01.2022	13.01.2022
B	xy	access	11.01.2022	14.01.2022
C	xy	access	11.01.2022	14.01.2022
A	xy	access	12.01.2022	15.01.2022
A	xy	access	14.01.2022	17.01.2022
B	xy	access	21.01.2022	24.01.2022
A	xy	access	21.01.2022	24.01.2022

There are three users (a, b, c) who request a license on different days. In principle, the end date is always three days later than the start date due to the fact that a license is locked for a period of 3 days.

For example, if user A accesses again within these three days, the period is extended again by three days.

My (ultimate) goal is to get a graph like the following:

target

So i can see how many licenses were blocked.

But this is just my goal in a later step. I thought the best way to achieve an output like this would be the following table (dataframe) in a next step and just plotting Sum over Date:

Date	A	B	C	Sum
10.01.2022	1	0	0	1
11.01.2022	1	1	1	3
12.01.2022	1	1	1	3
13.01.2022	1	1	1	3
14.01.2022	1	1	1	3
15.01.2022	1	0	0	1
16.01.2022	1	0	0	1
17.01.2022	1	0	0	1
18.01.2022	0	0	0	0
19.01.2022	0	0	0	0
20.01.2022	0	0	0	0
21.01.2022	1	1	0	2
22.01.2022	1	1	0	2
23.01.2022	1	1	0	2
24.01.2022	1	1	0	2

(But I am not sure if this is the best way to achieve it.)

Would that be possible with pandas? If yes, how? Tbh I have no clue. I hope I didn't explain the question too complicated.

So I'm only concerned with the second dataframe, how I get it, not the graph itself.

CodePudding user response：

From the graph it seems like you only care about the total number of active licenses per day. So, I'm providing an answer in that context. If you need the breakup at user level then it has to be changed a bit.

First, let's import the packages and create a sample dataframe. I've added one extra row for User A at the end to solidify some of the core concepts. Also, I'm explicitly changing the start and end date columns from string to date type.

import pandas as pd
import datetime

data_dict = {
'User': ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'A'],
'License': ['xy', 'xy', 'xy', 'xy', 'xy', 'xy', 'xy', 'xy'],
'Status': ['access', 'access', 'access', 'access', 'access', 'access', 'access', 'access'],
'Start-Date': ['10.01.2022', '11.01.2022', '11.01.2022', '12.01.2022', '14.01.2022', '21.01.2022', '21.01.2022', '16.01.2022'],
'End-Date': ['13.01.2022', '14.01.2022', '14.01.2022', '15.01.2022', '17.01.2022', '24.01.2022', '24.01.2022', '20.01.2022']
}

pdf = pd.DataFrame.from_dict(data=data_dict)
pdf['Start-Date'] = pd.to_datetime(pdf['Start-Date'], format='%d.%m.%Y')
pdf['End-Date'] = pd.to_datetime(pdf['End-Date'], format='%d.%m.%Y')

The dataframe will be like the one below.

User    License     Status  Start-Date  End-Date
A       xy          access  2022-01-10  2022-01-13
B       xy          access  2022-01-11  2022-01-14
C       xy          access  2022-01-11  2022-01-14
A       xy          access  2022-01-12  2022-01-15
A       xy          access  2022-01-14  2022-01-17
B       xy          access  2022-01-21  2022-01-24
A       xy          access  2022-01-21  2022-01-24
A       xy          access  2022-01-16  2022-01-20

The tricky part here is to merge the overlappihg intervals for each user into a bigger interval. I've borrowed the core ideas from this SO post. The idea is to first group your data per user (and other additional columns except dates). Then Within each group we need to further group the data into subgroups, so that the overlapping intervals per user belong to the same sub-group. Once we get that, then all we need is to extract the min start-date and max end-date per subgroup and assign that as the new start and end dates for all entries in that subgroup. In the end, we'll drop the duplicates because it's redundant info.

The next part is pretty straight-forward. we create a new dataframe with only dates that ranges between global start and end dates. Then we join these two dataframes and sum up all the licenses per day.

def f(df_grouped):
    df_grouped = df_grouped.sort_values(by='Start-Date').reset_index(drop=True)
    df_grouped["group"] = (df_grouped["Start-Date"] > df_grouped["End-Date"].shift()).cumsum()
    grp = df_grouped.groupby("group")
    df_grouped['New-Start-Date'] = grp['Start-Date'].transform('min')
    df_grouped['New-End-Date'] = grp['End-Date'].transform('max')
    return df_grouped.drop("group", axis=1)

pdf2 = pdf.groupby(['User', 'License', 'Status']).apply(f).drop(['Start-Date', 'End-Date'], axis=1).drop_duplicates()

pdf2 contains the refined start and end dates per user, as we can see below.

                                User    License     Status  New-Start-Date  New-End-Date
User    License     Status                      
A       xy          access  0   A       xy          access  2022-01-10      2022-01-20
                            4   A       xy          access  2022-01-21      2022-01-24
B       xy          access  0   B       xy          access  2022-01-11      2022-01-14
                            1   B       xy          access  2022-01-21      2022-01-24
C       xy          access  0   C       xy          access  2022-01-11      2022-01-14

Let's create the new dataframe now, with date ranges.

num_days = (pdf['End-Date'].max() - pdf['Start-Date'].min()).days   1
# num_days = 15

min_date = pdf['Start-Date'].min()
pdf3 = pd.DataFrame.from_dict(data={'Dates': [min_date   datetime.timedelta(days=n) for n in range(num_days)]})

pdf3 looks as follows:

Dates
2022-01-10
2022-01-11
2022-01-12
2022-01-13
2022-01-14
2022-01-15
2022-01-16
2022-01-17
2022-01-18
2022-01-19
2022-01-20
2022-01-21
2022-01-22
2022-01-23
2022-01-24

Let's now join these two dataframes, and remove bad rows:

pdf4 = pd.merge(pdf2, pdf3, how='cross')

def final_data_prep(start, end, date):
    if date >= start and date <= end:
        return 1
    else:
        return 0
    
pdf4['Num-License'] = pdf4[['New-Start-Date', 'New-End-Date', 'Dates']].apply(lambda x: final_data_prep(x[0], 
                                                           x[1], x[2]), axis=1)

pdf4 looks as follows:

    User    License     Status  New-Start-Date  New-End-Date    Dates       Num-License
0   A       xy          access  2022-01-10      2022-01-20      2022-01-10  1
1   A       xy          access  2022-01-10      2022-01-20      2022-01-11  1
2   A       xy          access  2022-01-10      2022-01-20      2022-01-12  1
3   A       xy          access  2022-01-10      2022-01-20      2022-01-13  1
4   A       xy          access  2022-01-10      2022-01-20      2022-01-14  1
... ...     ...         ...     ...             ...             ...         ...
70  C       xy          access  2022-01-11      2022-01-14      2022-01-20  0
71  C       xy          access  2022-01-11      2022-01-14      2022-01-21  0
72  C       xy          access  2022-01-11      2022-01-14      2022-01-22  0
73  C       xy          access  2022-01-11      2022-01-14      2022-01-23  0
74  C       xy          access  2022-01-11      2022-01-14      2022-01-24  0

Now, all we need to do is do a groupby per day, and voila!

pdf_final = pdf4[['Dates', 'User', 'License', 'Status', 'Num-License']].groupby('Dates')['Num-License'].sum().reset_index().sort_values(['Dates'], ascending=True).reset_index(drop=True)

# here's the final dataframe
        Dates       Num-License
0       2022-01-10  1
1       2022-01-11  3
2       2022-01-12  3
3       2022-01-13  3
4       2022-01-14  3
5       2022-01-15  1
6       2022-01-16  1
7       2022-01-17  1
8       2022-01-18  1
9       2022-01-19  1
10      2022-01-20  1
11      2022-01-21  2
12      2022-01-22  2
13      2022-01-23  2
14      2022-01-24  2

If you need to retain user level info as well, then just add the user col in the last groupby. Also, to replicate your results, remove the last row when creating the dataframe.