Home > Enterprise >  How to resample a dataframe ONLY when time range is too long?
How to resample a dataframe ONLY when time range is too long?

Time:02-25

I have a simple DataFrame like this:

timestamp Power
29/08/2021 02:30:16 155
29/08/2021 02:45:19 151
29/08/2021 03:00:14 155
29/08/2021 03:30:12 152
29/08/2021 04:00:12 149
29/08/2021 04:15:09 152
29/08/2021 04:30:16 153
29/08/2021 04:45:09 211
29/08/2021 05:30:19 77

So these data should be measured every 15 minutes, but for some reason some measurements have been skipped. I want to add the missing timestamps followed by a "NaN" when the measurement is skipped. I know that this can be done by the function "resample" but it's important to use it only when needed. So what I need is to add a condition to that function: I want to resample only between those rows that are (for example) more than 16 minutes distant from each other. In this way, when I don't need to resample, the timestamps will still be the original ones, and this is very important for my work. So what I want to obtain is, more or less:

timestamp Power
29/08/2021 02:30:16 155
29/08/2021 02:45:19 151
29/08/2021 03:00:14 155
29/08/2021 03:15:00 NaN
29/08/2021 03:30:12 152
29/08/2021 03:45:00 NaN
29/08/2021 04:00:12 149
29/08/2021 04:15:09 152
29/08/2021 04:30:16 153
29/08/2021 04:45:09 211
29/08/2021 05:00:00 NaN
29/08/2021 05:15:00 NaN
29/08/2021 05:30:19 77

CodePudding user response:

Okay, so this was trickier than I expected, but I think I figured it out. Here's my solution:

I created a toy example of your df (please provide this code yourself next time, like stated here for example)

import pandas as pd
import datetime

df = pd.DataFrame()
df['timestamp'] = ['29/08/2021 02:30:16', '29/08/2021 02:45:19', '29/08/2021 03:00:14', '29/08/2021 03:30:12']
df['Power'] = [155,151,155,152]

The df looks like this:

   timestamp              Power
0  29/08/2021 02:30:16    155
1  29/08/2021 02:45:19    151
2  29/08/2021 03:00:14    155
3  29/08/2021 03:30:12    152

First we convert the timestamp column into a pandas datetime object, and then replace the axis of the dataframe with it.

df.timestamp = pd.to_datetime(df.timestamp)
df.set_index('timestamp', inplace=True)

This allows us to use resample on it, but as you have already noticed this will create a completely new range of dates instead of incorporating your own. The way I went around this was by using resample only for each pair of contiguous timestamps. This way it will only add new entries if there is "space" for it between the timestamps.

final_df = pd.DataFrame()
timestamp_list = []
power_list = []
for i, timestamp in enumerate(df.index.to_list()):
    temp_df = df[i:i 2].resample('16Min', origin='start').asfreq()
    timestamp_list.extend(temp_df.index.to_list())
    power_list.extend(temp_df.Power.to_list())
final_df['timestamp'] = timestamp_list
final_df['Power'] = power_list

The result looks like this:

  timestamp            Power
0 2021-08-29 02:30:16  155.0
1 2021-08-29 02:45:19  151.0
2 2021-08-29 03:00:14  155.0
3 2021-08-29 03:15:14    NaN
4 2021-08-29 03:30:12  152.0

If you want to re-format the date to the exact same format you had before, I recommend looking into the datetime package. Or you can just do it manually by iterating through the column.

CodePudding user response:

To reproduce your data I have done:

import pandas as pd
data = pd.DataFrame.from_records(
    [
        ["29/08/2021 02:30:16", 155],
        ["29/08/2021 02:45:19", 151],
        ["29/08/2021 02:47:19", 152],
        ["29/08/2021 03:00:14", 155],
        ["29/08/2021 03:30:12", 152],
        ["29/08/2021 04:00:12", 149],
        ["29/08/2021 04:15:09", 152],
        ["29/08/2021 04:30:16", 153],
        ["29/08/2021 04:45:09", 211],
        ["29/08/2021 05:30:19", 77]
    ],
    columns=["timestamp", "Power"],
)
data["timestamp"] = pd.to_datetime(data["timestamp"])

In order to fill the gaps I have done the following steps.

First, create a new column with your rounded times stamps:

data["t_rounded"] = data["timestamp"].dt.round("15min")
data.set_index("t_rounded", inplace=True, drop=True)

Make sure there are no duplicated indices by dropping all duplicates and keeping only the first sample:

# drop any duplicated samples which occurred too close
is_duplicate = data.index.duplicated(keep='last')
# keep the duplicates which we are going to remove
duplicates_df = data[is_duplicate]

# remove the duplicates from the original data
data = data[~is_duplicate]

Then, create a new desired equidistant index:

new_index = pd.period_range(data.index.values[0], data.index.values[-1], freq="15min")
new_index = new_index.to_timestamp()

Now use the new index for your data frame:

data = data.reindex(new_index)
data.reset_index(inplace=True)

Next, impose the rounded timestamps (now called index because of the reset_index) to empty times

mask = data["timestamp"].isna()
data.loc[mask, "timestamp"] = data.loc[mask, "index"]

Finally, set the new filled timestamps as index and drop the rounded time column

data.set_index("timestamp", inplace=True, drop=True)
data.drop("index", inplace=True, axis=1)

In case needed, you can add the duplicated time stamps which we removed earlier by doing:

df = duplicates_df.reset_index().set_index("timestamp", drop=True).drop("t_rounded", axis=1)
data = pd.concat([data, df]).sort_index()

The final result looks like

                     Power
timestamp                 
2021-08-29 02:30:16  155.0
2021-08-29 02:45:19  151.0
2021-08-29 02:47:19  152.0
2021-08-29 03:00:14  155.0
2021-08-29 03:15:00    NaN
2021-08-29 03:30:12  152.0
2021-08-29 03:45:00    NaN
2021-08-29 04:00:12  149.0
2021-08-29 04:15:09  152.0
2021-08-29 04:30:16  153.0
2021-08-29 04:45:09  211.0
2021-08-29 05:00:00    NaN
2021-08-29 05:15:00    NaN
2021-08-29 05:30:19   77.0
  • Related