I have a time series with gaps (NaN) and I want to find the start and stop index of the longest consecutive sequence where no Nan occurs. I have no clue how to do that.
values = [5468.0,
5946.0,
np.nan,
6019.0,
5797.0,
5879.0,
np.nan,
5706.0,
5986.0,
6228.0,
6285.0,
np.nan,
5667.0,
5886.0,
6380.0,
5988.0,
6290.0,
5899.0,
6429.0,
6177.0]
dates = [Timestamp('2018-10-17 13:30:00'),
Timestamp('2018-10-17 14:00:00'),
Timestamp('2018-10-17 14:30:00'),
Timestamp('2018-10-17 15:00:00'),
Timestamp('2018-10-17 15:30:00'),
Timestamp('2018-10-17 16:00:00'),
Timestamp('2018-10-17 16:30:00'),
Timestamp('2018-10-17 17:00:00'),
Timestamp('2018-10-17 17:30:00'),
Timestamp('2018-10-17 18:00:00'),
Timestamp('2018-10-17 18:30:00'),
Timestamp('2018-10-17 19:00:00'),
Timestamp('2018-10-17 19:30:00'),
Timestamp('2018-10-17 20:00:00'),
Timestamp('2018-10-17 20:30:00'),
Timestamp('2018-10-17 21:00:00'),
Timestamp('2018-10-17 21:30:00'),
Timestamp('2018-10-17 22:00:00'),
Timestamp('2018-10-17 22:30:00'),
Timestamp('2018-10-17 23:00:00')]
I found a lot of solutions here on stack, but they all use days and then count with -1 but in case of my 30 mins frequency this doesn't work.
I know that I can get True/False with isnull()
and then groupby()
or use dates.diff()[1:]
but I have to less knowledge to find a solution.
CodePudding user response:
You are almost there. Create groups then find the difference between the second timestamp (the first has NaN) and the last of the series:
>>> (df.groupby(df['value'].isna().cumsum())['date']
.apply(lambda x: x.iloc[-1] - x.iloc[1]) # 1 to remove NaN
.max())
Timedelta('0 days 03:30:00')
Input dataframe:
>>> df
date value
0 2018-10-17 13:30:00 5468.0 # Group 0
1 2018-10-17 14:00:00 5946.0 # Group 0
2 2018-10-17 14:30:00 NaN # Group 1
3 2018-10-17 15:00:00 6019.0 # Group 1
4 2018-10-17 15:30:00 5797.0 # Group 1
5 2018-10-17 16:00:00 5879.0 # Group 1
6 2018-10-17 16:30:00 NaN # Group 2
7 2018-10-17 17:00:00 5706.0 # Group 2
8 2018-10-17 17:30:00 5986.0 # Group 2
9 2018-10-17 18:00:00 6228.0 # Group 2
10 2018-10-17 18:30:00 6285.0 # Group 2
11 2018-10-17 19:00:00 NaN # Group 3
12 2018-10-17 19:30:00 5667.0 # Group 3
13 2018-10-17 20:00:00 5886.0 # Group 3
14 2018-10-17 20:30:00 6380.0 # Group 3
15 2018-10-17 21:00:00 5988.0 # Group 3
16 2018-10-17 21:30:00 6290.0 # Group 3
17 2018-10-17 22:00:00 5899.0 # Group 3
18 2018-10-17 22:30:00 6429.0 # Group 3
19 2018-10-17 23:00:00 6177.0 # Group 3
Update To extract the longest series:
groups = df['value'].isna().cumsum()
best_group = df.groupby(groups)['date'].apply(lambda x: x.iloc[-1] - x.iloc[1]).idxmax()
out = df[groups == best_group]
print(out)
# Output
date value
11 2018-10-17 19:00:00 NaN
12 2018-10-17 19:30:00 5667.0
13 2018-10-17 20:00:00 5886.0
14 2018-10-17 20:30:00 6380.0
15 2018-10-17 21:00:00 5988.0
16 2018-10-17 21:30:00 6290.0
17 2018-10-17 22:00:00 5899.0
18 2018-10-17 22:30:00 6429.0
19 2018-10-17 23:00:00 6177.0