Home > Net >  Dataframe - how to insert new row with null value, conditionally based on elapsed time?
Dataframe - how to insert new row with null value, conditionally based on elapsed time?

Time:07-23

Background: My dataset aquires values at roughly 5 minute intervals, but sometimes there are gaps. I am charting my dataset using Plotly and attempting to resolve an issue where a straight line is drawn between points if there is a gap in the dataset. Plotly has a parameter connectgaps which if set to false will not connect over 'nan' values. However, my dataset looks like this:

(where I have computed the time difference using df['time_diff_mins'] = (df['datetime'].shift(-1) - df['datetime']).dt.total_seconds() / 60)

     datetime                 value   time_diff_mins
0    2022-03-09 09:25:00      98      5
1    2022-03-09 09:30:00      104     21
2    2022-03-09 09:51:00      105     3
3    2022-03-09 09:54:00      110     nan

If you look at rows 1 and 2, the time difference is 21 minutes. For this reason, I don't want the values 104 and 105 to be connected - I want a break in the line if there is a gap of greater than 15 mins and 15 seconds.

So, I am trying to insert a new row with null/nan values in my dataframe if the time difference between rows is greater than 15 mins and 15 seconds, so that Plotly will not connect the gaps.

Desired output:

     datetime                 value 
0    2022-03-09 09:25:00      98     
1    2022-03-09 09:30:00      104     
2    2022-03-09 09:40:30      nan
3    2022-03-09 09:51:00      105     
4    2022-03-09 09:54:00      110

I hope that makes sense. I know that inserting rows programmatically is probably not an optimal solution, so I haven't been able to find a good answer to this. Thank you in advance for your time and help.

CodePudding user response:

You can use a mask and pandas.concat

df['datetime'] = pd.to_datetime(df['datetime'])

delta = '15 min 15 s'

d = df['datetime'].diff().shift(-1)

out = (pd.concat([df, 
                  df['datetime'].add(d/2).
                    .loc[d.gt(delta)].to_frame()
                  ])
         .sort_index()
       )

Output:

             datetime  value  time_diff_mins
0 2022-03-09 09:25:00   98.0             5.0
1 2022-03-09 09:30:00  104.0            21.0
1 2022-03-09 09:40:30    NaN             NaN
2 2022-03-09 09:51:00  105.0             3.0
3 2022-03-09 09:54:00  110.0             NaN
  • Related