I have a new code I'm trying to write where a dataframe gets filtered/edited to obtain "stints" for each individual. Using the dataframe below as an example, I'm basically trying to get each persons start/end dates for a given location. Usually I can get started on my own but I'm stumped as to how to approach this so if anyone has ideas I would greatly appreciate it.
Person | Location | Date | |
---|---|---|---|
0 | Tom | A | 1/1/2021 |
1 | Tom | A | 1/2/2021 |
2 | Tom | A | 1/3/2021 |
3 | Tom | B | 1/4/2021 |
4 | Tom | B | 1/5/2021 |
5 | Tom | B | 1/6/2021 |
6 | Tom | A | 1/7/2021 |
7 | Tom | A | 1/8/2021 |
8 | Tom | A | 1/9/2021 |
9 | Tom | C | 1/10/2021 |
10 | Tom | C | 1/11/2021 |
11 | Tom | A | 1/12/2021 |
12 | Tom | A | 1/13/2021 |
13 | Tom | B | 1/14/2021 |
14 | Tom | B | 1/15/2021 |
15 | Mark | A | 1/1/2021 |
16 | Mark | A | 1/2/2021 |
17 | Mark | B | 1/3/2021 |
18 | Mark | B | 1/4/2021 |
19 | Mark | A | 1/5/2021 |
20 | Mark | A | 1/6/2021 |
21 | Mark | C | 1/7/2021 |
22 | Mark | C | 1/8/2021 |
23 | Mark | C | 1/9/2021 |
24 | Mark | C | 1/10/2021 |
25 | Mark | A | 1/11/2021 |
26 | Mark | A | 1/12/2021 |
27 | Mark | B | 1/13/2021 |
28 | Mark | B | 1/14/2021 |
29 | Mark | B | 1/15/2021 |
Expected outcome:
Person | Location | StintNum | Start_Date | End Date | |
---|---|---|---|---|---|
0 | Tom | A | 1 | 1/1/2021 | 1/3/2021 |
1 | Tom | B | 2 | 1/4/2021 | 1/6/2021 |
2 | Tom | A | 3 | 1/7/2021 | 1/9/2021 |
3 | Tom | C | 4 | 1/10/2021 | 1/11/2021 |
4 | Tom | A | 5 | 1/12/2021 | 1/13/2021 |
5 | Tom | B | 6 | 1/14/2021 | 1/15/2021 |
6 | Mark | A | 1 | 1/1/2021 | 1/2/2021 |
7 | Mark | B | 2 | 1/3/2021 | 1/4/2021 |
8 | Mark | A | 3 | 1/5/2021 | 1/6/2021 |
9 | Mark | C | 4 | 1/7/2021 | 1/10/2021 |
10 | Mark | A | 5 | 1/11/2021 | 1/12/2021 |
11 | Mark | B | 6 | 1/13/2021 | 1/15/2021 |
CodePudding user response:
Try this:
df['Date'] = pd.to_datetime(df['Date'])
new_df = df.groupby([df['Person'], df['Location'].ne(df['Location'].shift(1)).cumsum()], sort=False).apply(lambda x: pd.Series([x['Date'].min(), x['Date'].max()], index=['Start_Date','End_Date'])).reset_index()
new_df['StintNum'] = new_df.groupby('Person').cumcount().add(1)
Output:
>>> new_df
Person Location Start_Date End_Date StintNum
0 Tom 1 2021-01-01 2021-01-03 1
1 Tom 2 2021-01-04 2021-01-06 2
2 Tom 3 2021-01-07 2021-01-09 3
3 Tom 4 2021-01-10 2021-01-11 4
4 Tom 5 2021-01-12 2021-01-13 5
5 Tom 6 2021-01-14 2021-01-15 6
6 Mark 7 2021-01-01 2021-01-02 1
7 Mark 8 2021-01-03 2021-01-04 2
8 Mark 9 2021-01-05 2021-01-06 3
9 Mark 10 2021-01-07 2021-01-10 4
10 Mark 11 2021-01-11 2021-01-12 5
11 Mark 12 2021-01-13 2021-01-15 6
CodePudding user response:
IMO, a clean way is to use groupby
agg
, this enables to set custom aggregators easily and is faster than apply
:
df['Date'] = pd.to_datetime(df['Date'])
group = df['Location'].ne(df['Location'].shift()).cumsum()
df2 = (
df.groupby(['Person', group], as_index=False)
.agg(Location=('Location', 'first'),
# line below is a dummy function to set a column placeholder
# uncomment it you want the columns in order
#StintNum=('Location', lambda x: float('NaN')),
Start_Date=('Date', 'min'),
End_Date=('Date', 'max'),
)
)
df2['StintNum'] = df2.groupby('Person').cumcount().add(1)
Output:
Person Location StintNum Start_Date End_Date
0 Mark A 1 2021-01-01 2021-01-02
1 Mark B 2 2021-01-03 2021-01-04
2 Mark A 3 2021-01-05 2021-01-06
3 Mark C 4 2021-01-07 2021-01-10
4 Mark A 5 2021-01-11 2021-01-12
5 Mark B 6 2021-01-13 2021-01-15
6 Tom A 1 2021-01-01 2021-01-03
7 Tom B 2 2021-01-04 2021-01-06
8 Tom A 3 2021-01-07 2021-01-09
9 Tom C 4 2021-01-10 2021-01-11
10 Tom A 5 2021-01-12 2021-01-13
11 Tom B 6 2021-01-14 2021-01-15