I have a dataframe which looks something like this:
id ts factor
A 2020-01-01 1
A 2020-01-02 1
A 2020-01-03 1
A 2020-01-04 1
A 2020-01-05 1
A 2020-01-06 10
A 2020-01-07 10
A 2020-01-08 10
A 2020-01-09 10
A 2020-01-10 10
A 2020-01-11 10
A 2020-01-12 10
A 2020-01-13 10
A 2020-01-14 10
A 2020-01-15 10
A 2020-01-16 10
A 2020-01-17 10
A 2020-01-18 1
A 2020-01-19 1
A 2020-01-20 1
my desire output is:
id start_ts end_ts factor
A 2020-01-01 2020-01-05 1
A 2020-01-06 2020-01-17 10
A 2020-01-18 2020-01-20 1
so far I can only think of groupby on factor and then do min and max operation, but that doesn't work for factor 1
df.groupby(["factor"]).agg({'date' : [np.min, np.max]})
how can I achieve the output?
CodePudding user response:
Use cumsum
on comparison with shift of factor
to find the factor
blocks, then add it to groupby
:
blocks = df['factor'].ne(df['factor'].shift()).cumsum()
df.groupby(['id','factor',blocks], sort=False)['ts'].agg(['min','max'])
Output:
min max
id factor factor
A 1 1 2020-01-01 2020-01-05
10 2 2020-01-06 2020-01-17
1 3 2020-01-18 2020-01-20
CodePudding user response:
slightly updated variant of @Quang Hoang with named grouping:
blocks = df['factor'].ne(df['factor'].shift()).cumsum()
blocks = blocks.rename("group")
df2 = df.groupby(['id', blocks,'factor']).agg(
start_ts=('ts', 'min'),
end_ts=('ts', 'max'))\
.reset_index()\
.drop("group", axis=1)
out:
print(df2)
id factor start_ts end_ts
0 A 1 2020-01-01 2020-01-05
1 A 10 2020-01-06 2020-01-17
2 A 1 2020-01-18 2020-01-20