I have the following problem which as I read, it's mentioned as the 'gaps and islands' problem. A sample of my data are like the below (I excluded some columns related to the user also, but I suppose the approach would be the same):
user | reports_to | date |
---|---|---|
john | mark | 1/1/2021 |
john | mark | 2/1/2021 |
john | mark | 3/1/2021 |
john | mark | 4/1/2021 |
john | mark | 5/1/2021 |
john | mark | 6/1/2021 |
john | mark | 7/1/2021 |
john | mark | 8/1/2021 |
john | mark | 9/1/2021 |
john | mark | 10/1/2021 |
john | mark | 11/1/2021 |
john | nick | 12/1/2021 |
john | nick | 13/1/2021 |
john | nick | 14/1/2021 |
john | nick | 15/1/2021 |
john | nick | 16/1/2021 |
john | nick | 17/1/2021 |
john | nick | 18/1/2021 |
john | nick | 19/1/2021 |
john | nick | 20/1/2021 |
john | mark | 21/1/2021 |
john | mark | 22/1/2021 |
john | mark | 23/1/2021 |
john | mark | 24/1/2021 |
john | mark | 25/1/2021 |
My dataset contains thousands of rows and couple more columns that are related to the characteristics of the user. That pattern may exist in multiple users.
When I try to find the min and max date that a user had that manager using a simple group by, I get the following result:
user | reports_to | from | to |
---|---|---|---|
john | mark | 1/1/2021 | 25/1/2021 |
john | nick | 12/1/2021 | 20/1/2021 |
Logically, that result is wrong. The desired result would be the following:
user | reports_to | from | to |
---|---|---|---|
john | mark | 1/1/2021 | 11/1/2021 |
john | nick | 12/1/2021 | 20/1/2021 |
john | mark | 21/1/2021 | 25/1/2021 |
My project is on python and I try to solve it using window_functions or rank functions like the following
df["rank"] = df.groupby(['user','reports_to'])["date"].rank("dense", ascending=True)
in order to compare the rank vs the previous date, but I still cannot solve that problem. Which approach would be the best and thank you in advance for your time.
CodePudding user response:
As @Arne mentions in comments, linked answer does work if you assign a new data frame column and then aggregate on it with user
and reports_to
.
Assignment
# ADD NEW GROUP SEQ COLUMN
staff_df = (
staff_df.assign(
group_seq = lambda df: (
df["reports_to"].ne(df["reports_to"].shift()).cumsum()
)
)
)
staff_df
# user reports_to date group_seq
# 0 john mark 2021-01-01 1
# 1 john mark 2021-01-02 1
# 2 john mark 2021-01-03 1
# 3 john mark 2021-01-04 1
# 4 john mark 2021-01-05 1
# 5 john mark 2021-01-06 1
# 6 john mark 2021-01-07 1
# 7 john mark 2021-01-08 1
# 8 john mark 2021-01-09 1
# 9 john mark 2021-01-10 1
# 10 john mark 2021-01-11 1
# 11 john nick 2021-01-12 2
# 12 john nick 2021-01-13 2
# 13 john nick 2021-01-14 2
# 14 john nick 2021-01-15 2
# 15 john nick 2021-01-16 2
# 16 john nick 2021-01-17 2
# 17 john nick 2021-01-18 2
# 18 john nick 2021-01-19 2
# 19 john nick 2021-01-20 2
# 20 john mark 2021-01-21 3
# 21 john mark 2021-01-22 3
# 22 john mark 2021-01-23 3
# 23 john mark 2021-01-24 3
# 24 john mark 2021-01-25 3
Aggregation
agg_df = (
staff_df.groupby(["user", "reports_to", "group_seq"]).agg(
min = ("date", "min"), max = ("date", "max")
).sort_values(["min", "max"])
)
agg_df
# min max
# user reports_to group_seq
# john mark 1 2021-01-01 2021-01-11
# nick 2 2021-01-12 2021-01-20
# mark 3 2021-01-21 2021-01-25
CodePudding user response:
The code is here.
import pandas as pd
from io import StringIO
df = '''
user reports_to date
john mark 1/1/2021
john mark 2/1/2021
john mark 3/1/2021
john mark 4/1/2021
john mark 5/1/2021
john mark 6/1/2021
john mark 7/1/2021
john mark 8/1/2021
john mark 9/1/2021
john mark 10/1/2021
john mark 11/1/2021
john nick 12/1/2021
john nick 13/1/2021
john nick 14/1/2021
john nick 15/1/2021
john nick 16/1/2021
john nick 17/1/2021
john nick 18/1/2021
john nick 19/1/2021
john nick 20/1/2021
john mark 21/1/2021
john mark 22/1/2021
john mark 23/1/2021
john mark 24/1/2021
john mark 25/1/2021
'''
inf=StringIO(df)
df = pd.read_csv(inf,sep="\t")
df['group'] = (( (df.user != df.user.shift() ) | (df.reports_to != df.reports_to.shift()) ).cumsum())
# must do this
df['date'] = pd.to_datetime(df['date'])
result = ( df.groupby(['user','group','reports_to'])['date']
.agg({min, max})
.rename(columns={'min': 'from','max':'to'})
.reset_index() )