Gaps and Islands Python dataframe pandas-CodePudding

I have the following problem which as I read, it's mentioned as the 'gaps and islands' problem. A sample of my data are like the below (I excluded some columns related to the user also, but I suppose the approach would be the same):

user	reports_to	date
john	mark	1/1/2021
john	mark	2/1/2021
john	mark	3/1/2021
john	mark	4/1/2021
john	mark	5/1/2021
john	mark	6/1/2021
john	mark	7/1/2021
john	mark	8/1/2021
john	mark	9/1/2021
john	mark	10/1/2021
john	mark	11/1/2021
john	nick	12/1/2021
john	nick	13/1/2021
john	nick	14/1/2021
john	nick	15/1/2021
john	nick	16/1/2021
john	nick	17/1/2021
john	nick	18/1/2021
john	nick	19/1/2021
john	nick	20/1/2021
john	mark	21/1/2021
john	mark	22/1/2021
john	mark	23/1/2021
john	mark	24/1/2021
john	mark	25/1/2021

My dataset contains thousands of rows and couple more columns that are related to the characteristics of the user. That pattern may exist in multiple users.

When I try to find the min and max date that a user had that manager using a simple group by, I get the following result:

user	reports_to	from	to
john	mark	1/1/2021	25/1/2021
john	nick	12/1/2021	20/1/2021

Logically, that result is wrong. The desired result would be the following:

user	reports_to	from	to
john	mark	1/1/2021	11/1/2021
john	nick	12/1/2021	20/1/2021
john	mark	21/1/2021	25/1/2021

My project is on python and I try to solve it using window_functions or rank functions like the following

df["rank"] = df.groupby(['user','reports_to'])["date"].rank("dense", ascending=True)

in order to compare the rank vs the previous date, but I still cannot solve that problem. Which approach would be the best and thank you in advance for your time.

CodePudding user response：

As @Arne mentions in comments, linked answer does work if you assign a new data frame column and then aggregate on it with user and reports_to.

Assignment

# ADD NEW GROUP SEQ COLUMN
staff_df = (
    staff_df.assign(
        group_seq = lambda df: (
            df["reports_to"].ne(df["reports_to"].shift()).cumsum()
        )
    )
)

staff_df
#     user reports_to       date  group_seq
# 0   john       mark 2021-01-01          1
# 1   john       mark 2021-01-02          1
# 2   john       mark 2021-01-03          1
# 3   john       mark 2021-01-04          1
# 4   john       mark 2021-01-05          1
# 5   john       mark 2021-01-06          1
# 6   john       mark 2021-01-07          1
# 7   john       mark 2021-01-08          1
# 8   john       mark 2021-01-09          1
# 9   john       mark 2021-01-10          1
# 10  john       mark 2021-01-11          1
# 11  john       nick 2021-01-12          2
# 12  john       nick 2021-01-13          2
# 13  john       nick 2021-01-14          2
# 14  john       nick 2021-01-15          2
# 15  john       nick 2021-01-16          2
# 16  john       nick 2021-01-17          2
# 17  john       nick 2021-01-18          2
# 18  john       nick 2021-01-19          2
# 19  john       nick 2021-01-20          2
# 20  john       mark 2021-01-21          3
# 21  john       mark 2021-01-22          3
# 22  john       mark 2021-01-23          3
# 23  john       mark 2021-01-24          3
# 24  john       mark 2021-01-25          3

Aggregation

agg_df = (
    staff_df.groupby(["user", "reports_to", "group_seq"]).agg(
        min = ("date", "min"), max = ("date", "max")
    ).sort_values(["min", "max"])
)

agg_df
#                                  min        max
# user reports_to group_seq                      
# john mark       1         2021-01-01 2021-01-11
#      nick       2         2021-01-12 2021-01-20
#      mark       3         2021-01-21 2021-01-25

CodePudding user response：

The code is here.

import pandas as pd
from io import StringIO
df = '''
user    reports_to  date
john    mark    1/1/2021
john    mark    2/1/2021
john    mark    3/1/2021
john    mark    4/1/2021
john    mark    5/1/2021
john    mark    6/1/2021
john    mark    7/1/2021
john    mark    8/1/2021
john    mark    9/1/2021
john    mark    10/1/2021
john    mark    11/1/2021
john    nick    12/1/2021
john    nick    13/1/2021
john    nick    14/1/2021
john    nick    15/1/2021
john    nick    16/1/2021
john    nick    17/1/2021
john    nick    18/1/2021
john    nick    19/1/2021
john    nick    20/1/2021
john    mark    21/1/2021
john    mark    22/1/2021
john    mark    23/1/2021
john    mark    24/1/2021
john    mark    25/1/2021
'''
inf=StringIO(df)
df = pd.read_csv(inf,sep="\t")


df['group'] = (( (df.user != df.user.shift() ) | (df.reports_to != df.reports_to.shift())  ).cumsum())

# must do this 
df['date'] = pd.to_datetime(df['date']) 

result = ( df.groupby(['user','group','reports_to'])['date']
    .agg({min, max})
    .rename(columns={'min': 'from','max':'to'})
    .reset_index() )