I have the following data set.
ID Date
abc 2017-01-07
abc 2017-01-08
abc 2017-01-09
abc 2017-12-09
xyz 2017-01-05
xyz 2017-01-06
xyz 2017-04-15
xyz 2017-04-16
I am able to generate the following output
ID Count
abc 3
abc 1
xyz 2
xyz 2
using the following code mentioned in count consecutive days python dataframe
d = {
'ID': ['abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz'],
'Date': ['2017-01-07','2017-01-08', '2017-01-09', '2017-12-09', '2017-01-05', '2017-01-06', '2017-04-15', '2017-04-16']
}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df['Date'])
series = df.groupby('ID').Date.diff().dt.days.ne(1).cumsum()
df.groupby(['ID', series]).size().reset_index(level=1, drop=True)
How can I get the following output?
ID Start End
abc 2017-01-07 2017-01-09
abc 2017-12-09 2017-12-09
xyz 2017-01-05 2017-01-06
xyz 2017-04-15 2017-04-16
CodePudding user response:
You can use:
series = df.groupby('ID').Date.diff().dt.days.ne(1).cumsum()
(df.groupby(['ID', series])
.agg(Start=('Date', 'min'), End=('Date', 'min'))
.droplevel(1)
.reset_index()
)
output:
ID Start End
0 abc 2017-01-07 2017-01-07
1 abc 2017-12-09 2017-12-09
2 xyz 2017-01-05 2017-01-05
3 xyz 2017-04-15 2017-04-15
CodePudding user response:
Use @mozway's answer
Use agg
:
out = df.groupby(df.groupby('ID')['Date'].diff().ne(pd.Timedelta(days=1)).cumsum()) \
['Date'].agg(**{'Start': 'first', 'End': 'last'}).reset_index()
print(out)
# Output:
Start End
Date
1 2017-01-07 2017-01-09
2 2017-12-09 2017-12-09
3 2017-01-05 2017-01-06
4 2017-04-15 2017-04-16