Find index of longest consectuive timespan in pandas time series-CodePudding

I have a time series with gaps (NaN) and I want to find the start and stop index of the longest consecutive sequence where no Nan occurs. I have no clue how to do that.

values = [5468.0,
 5946.0,
 np.nan,
 6019.0,
 5797.0,
 5879.0,
 np.nan,
 5706.0,
 5986.0,
 6228.0,
 6285.0,
 np.nan,
 5667.0,
 5886.0,
 6380.0,
 5988.0,
 6290.0,
 5899.0,
 6429.0,
 6177.0]

dates = [Timestamp('2018-10-17 13:30:00'),
 Timestamp('2018-10-17 14:00:00'),
 Timestamp('2018-10-17 14:30:00'),
 Timestamp('2018-10-17 15:00:00'),
 Timestamp('2018-10-17 15:30:00'),
 Timestamp('2018-10-17 16:00:00'),
 Timestamp('2018-10-17 16:30:00'),
 Timestamp('2018-10-17 17:00:00'),
 Timestamp('2018-10-17 17:30:00'),
 Timestamp('2018-10-17 18:00:00'),
 Timestamp('2018-10-17 18:30:00'),
 Timestamp('2018-10-17 19:00:00'),
 Timestamp('2018-10-17 19:30:00'),
 Timestamp('2018-10-17 20:00:00'),
 Timestamp('2018-10-17 20:30:00'),
 Timestamp('2018-10-17 21:00:00'),
 Timestamp('2018-10-17 21:30:00'),
 Timestamp('2018-10-17 22:00:00'),
 Timestamp('2018-10-17 22:30:00'),
 Timestamp('2018-10-17 23:00:00')]

I found a lot of solutions here on stack, but they all use days and then count with -1 but in case of my 30 mins frequency this doesn't work.

I know that I can get True/False with isnull() and then groupby() or use dates.diff()[1:] but I have to less knowledge to find a solution.

CodePudding user response：

You are almost there. Create groups then find the difference between the second timestamp (the first has NaN) and the last of the series:

>>> (df.groupby(df['value'].isna().cumsum())['date']
       .apply(lambda x: x.iloc[-1] - x.iloc[1])  # 1 to remove NaN
       .max())

Timedelta('0 days 03:30:00')

Input dataframe:

>>> df
                  date   value
0  2018-10-17 13:30:00  5468.0  # Group 0
1  2018-10-17 14:00:00  5946.0  # Group 0
2  2018-10-17 14:30:00     NaN  # Group 1
3  2018-10-17 15:00:00  6019.0  # Group 1
4  2018-10-17 15:30:00  5797.0  # Group 1
5  2018-10-17 16:00:00  5879.0  # Group 1
6  2018-10-17 16:30:00     NaN  # Group 2
7  2018-10-17 17:00:00  5706.0  # Group 2
8  2018-10-17 17:30:00  5986.0  # Group 2
9  2018-10-17 18:00:00  6228.0  # Group 2
10 2018-10-17 18:30:00  6285.0  # Group 2
11 2018-10-17 19:00:00     NaN  # Group 3
12 2018-10-17 19:30:00  5667.0  # Group 3
13 2018-10-17 20:00:00  5886.0  # Group 3
14 2018-10-17 20:30:00  6380.0  # Group 3
15 2018-10-17 21:00:00  5988.0  # Group 3
16 2018-10-17 21:30:00  6290.0  # Group 3
17 2018-10-17 22:00:00  5899.0  # Group 3
18 2018-10-17 22:30:00  6429.0  # Group 3
19 2018-10-17 23:00:00  6177.0  # Group 3

Update To extract the longest series:

groups = df['value'].isna().cumsum()
best_group = df.groupby(groups)['date'].apply(lambda x: x.iloc[-1] - x.iloc[1]).idxmax()
out = df[groups == best_group]
print(out)

# Output
                  date   value
11 2018-10-17 19:00:00     NaN
12 2018-10-17 19:30:00  5667.0
13 2018-10-17 20:00:00  5886.0
14 2018-10-17 20:30:00  6380.0
15 2018-10-17 21:00:00  5988.0
16 2018-10-17 21:30:00  6290.0
17 2018-10-17 22:00:00  5899.0
18 2018-10-17 22:30:00  6429.0
19 2018-10-17 23:00:00  6177.0