Home > Software design >  Dataframe groupby to get stints
Dataframe groupby to get stints

Time:01-02

I have a new code I'm trying to write where a dataframe gets filtered/edited to obtain "stints" for each individual. Using the dataframe below as an example, I'm basically trying to get each persons start/end dates for a given location. Usually I can get started on my own but I'm stumped as to how to approach this so if anyone has ideas I would greatly appreciate it.

Person Location Date
0 Tom A 1/1/2021
1 Tom A 1/2/2021
2 Tom A 1/3/2021
3 Tom B 1/4/2021
4 Tom B 1/5/2021
5 Tom B 1/6/2021
6 Tom A 1/7/2021
7 Tom A 1/8/2021
8 Tom A 1/9/2021
9 Tom C 1/10/2021
10 Tom C 1/11/2021
11 Tom A 1/12/2021
12 Tom A 1/13/2021
13 Tom B 1/14/2021
14 Tom B 1/15/2021
15 Mark A 1/1/2021
16 Mark A 1/2/2021
17 Mark B 1/3/2021
18 Mark B 1/4/2021
19 Mark A 1/5/2021
20 Mark A 1/6/2021
21 Mark C 1/7/2021
22 Mark C 1/8/2021
23 Mark C 1/9/2021
24 Mark C 1/10/2021
25 Mark A 1/11/2021
26 Mark A 1/12/2021
27 Mark B 1/13/2021
28 Mark B 1/14/2021
29 Mark B 1/15/2021

Expected outcome:

Person Location StintNum Start_Date End Date
0 Tom A 1 1/1/2021 1/3/2021
1 Tom B 2 1/4/2021 1/6/2021
2 Tom A 3 1/7/2021 1/9/2021
3 Tom C 4 1/10/2021 1/11/2021
4 Tom A 5 1/12/2021 1/13/2021
5 Tom B 6 1/14/2021 1/15/2021
6 Mark A 1 1/1/2021 1/2/2021
7 Mark B 2 1/3/2021 1/4/2021
8 Mark A 3 1/5/2021 1/6/2021
9 Mark C 4 1/7/2021 1/10/2021
10 Mark A 5 1/11/2021 1/12/2021
11 Mark B 6 1/13/2021 1/15/2021

CodePudding user response:

Try this:

df['Date'] = pd.to_datetime(df['Date'])
new_df = df.groupby([df['Person'], df['Location'].ne(df['Location'].shift(1)).cumsum()], sort=False).apply(lambda x: pd.Series([x['Date'].min(), x['Date'].max()], index=['Start_Date','End_Date'])).reset_index()
new_df['StintNum'] = new_df.groupby('Person').cumcount().add(1)

Output:

>>> new_df
   Person  Location Start_Date   End_Date  StintNum
0     Tom         1 2021-01-01 2021-01-03         1
1     Tom         2 2021-01-04 2021-01-06         2
2     Tom         3 2021-01-07 2021-01-09         3
3     Tom         4 2021-01-10 2021-01-11         4
4     Tom         5 2021-01-12 2021-01-13         5
5     Tom         6 2021-01-14 2021-01-15         6
6    Mark         7 2021-01-01 2021-01-02         1
7    Mark         8 2021-01-03 2021-01-04         2
8    Mark         9 2021-01-05 2021-01-06         3
9    Mark        10 2021-01-07 2021-01-10         4
10   Mark        11 2021-01-11 2021-01-12         5
11   Mark        12 2021-01-13 2021-01-15         6

CodePudding user response:

IMO, a clean way is to use groupby agg, this enables to set custom aggregators easily and is faster than apply:

df['Date'] = pd.to_datetime(df['Date'])

group = df['Location'].ne(df['Location'].shift()).cumsum()

df2 = (
 df.groupby(['Person', group], as_index=False)
   .agg(Location=('Location', 'first'),
        # line below is a dummy function to set a column placeholder
        # uncomment it you want the columns in order
        #StintNum=('Location', lambda x: float('NaN')),
        Start_Date=('Date', 'min'), 
        End_Date=('Date', 'max'),
       )
)

df2['StintNum'] = df2.groupby('Person').cumcount().add(1)

Output:

   Person Location  StintNum Start_Date   End_Date
0    Mark        A         1 2021-01-01 2021-01-02
1    Mark        B         2 2021-01-03 2021-01-04
2    Mark        A         3 2021-01-05 2021-01-06
3    Mark        C         4 2021-01-07 2021-01-10
4    Mark        A         5 2021-01-11 2021-01-12
5    Mark        B         6 2021-01-13 2021-01-15
6     Tom        A         1 2021-01-01 2021-01-03
7     Tom        B         2 2021-01-04 2021-01-06
8     Tom        A         3 2021-01-07 2021-01-09
9     Tom        C         4 2021-01-10 2021-01-11
10    Tom        A         5 2021-01-12 2021-01-13
11    Tom        B         6 2021-01-14 2021-01-15
  • Related