Home > database >  Pandas get date range from timeseries column
Pandas get date range from timeseries column

Time:05-25

I have a dataframe which looks something like this:

id  ts  factor
A   2020-01-01  1
A   2020-01-02  1
A   2020-01-03  1
A   2020-01-04  1
A   2020-01-05  1
A   2020-01-06  10
A   2020-01-07  10
A   2020-01-08  10
A   2020-01-09  10
A   2020-01-10  10
A   2020-01-11  10
A   2020-01-12  10
A   2020-01-13  10
A   2020-01-14  10
A   2020-01-15  10
A   2020-01-16  10
A   2020-01-17  10
A   2020-01-18  1
A   2020-01-19  1
A   2020-01-20  1

my desire output is:

id  start_ts    end_ts  factor
A   2020-01-01  2020-01-05  1
A   2020-01-06  2020-01-17  10
A   2020-01-18  2020-01-20  1

so far I can only think of groupby on factor and then do min and max operation, but that doesn't work for factor 1

df.groupby(["factor"]).agg({'date' : [np.min, np.max]})

how can I achieve the output?

CodePudding user response:

Use cumsum on comparison with shift of factor to find the factor blocks, then add it to groupby:

blocks = df['factor'].ne(df['factor'].shift()).cumsum()
df.groupby(['id','factor',blocks], sort=False)['ts'].agg(['min','max'])

Output:

                         min         max
id factor factor                        
A  1      1       2020-01-01  2020-01-05
   10     2       2020-01-06  2020-01-17
   1      3       2020-01-18  2020-01-20

CodePudding user response:

slightly updated variant of @Quang Hoang with named grouping:

blocks = df['factor'].ne(df['factor'].shift()).cumsum()
blocks = blocks.rename("group")

df2 = df.groupby(['id', blocks,'factor']).agg(
    start_ts=('ts', 'min'),
    end_ts=('ts', 'max'))\
    .reset_index()\
    .drop("group", axis=1)

out:

print(df2)
  id  factor    start_ts      end_ts
0  A       1  2020-01-01  2020-01-05
1  A      10  2020-01-06  2020-01-17
2  A       1  2020-01-18  2020-01-20
  • Related