I have data that looks like this
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'DATE': ['1/1/2015','1/2/2015', '1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015','1/8/2015',
'1/9/2016','1/2/2015','1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015'],
'CD': ['A','A','A','A','B','B','A','A','C','A','A','A','A','A','A']})
What I would like to do is group by ID and CD and get the start and stop change for each change. I tried using groupby and agg function but it will group all A together even though they needs to be separated since there is B in between 2 A.
df1 = df.groupby(['ID','CD'])
df1 = df1.agg(
Start_Date = ('Date',np.min),
End_Date=('Date', np.min)
).reset_index()
What I get is :
I was hoping if some one could help me get the result I need. What I am looking for is :
CodePudding user response:
make grouper
for grouping
grouper = df['CD'].ne(df['CD'].shift(1)).cumsum()
grouper
:
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 4
9 5
10 5
11 5
12 5
13 5
14 5
Name: CD, dtype: int32
then use groupby with grouper
df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)
output:
min max
ID CD
1 A 1/1/2015 1/4/2015
B 1/5/2015 1/6/2015
A 1/7/2015 1/8/2015
C 1/9/2016 1/9/2016
2 A 1/2/2015 1/7/2015
change column name and use reset_index and so on..for your desired output
(df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)
.set_axis(['Start_Date', 'End_Date'], axis=1)
.reset_index()
.assign(CD=lambda x: x.pop('CD')))
result
ID Start_Date End_Date CD
0 1 1/1/2015 1/4/2015 A
1 1 1/5/2015 1/6/2015 B
2 1 1/7/2015 1/8/2015 A
3 1 1/9/2016 1/9/2016 C
4 2 1/2/2015 1/7/2015 A