I have a simple DataFrame like this:
timestamp | Power |
---|---|
29/08/2021 02:30:16 | 155 |
29/08/2021 02:45:19 | 151 |
29/08/2021 03:00:14 | 155 |
29/08/2021 03:30:12 | 152 |
29/08/2021 04:00:12 | 149 |
29/08/2021 04:15:09 | 152 |
29/08/2021 04:30:16 | 153 |
29/08/2021 04:45:09 | 211 |
29/08/2021 05:30:19 | 77 |
So these data should be measured every 15 minutes, but for some reason some measurements have been skipped. I want to add the missing timestamps followed by a "NaN" when the measurement is skipped. I know that this can be done by the function "resample" but it's important to use it only when needed. So what I need is to add a condition to that function: I want to resample only between those rows that are (for example) more than 16 minutes distant from each other. In this way, when I don't need to resample, the timestamps will still be the original ones, and this is very important for my work. So what I want to obtain is, more or less:
timestamp | Power |
---|---|
29/08/2021 02:30:16 | 155 |
29/08/2021 02:45:19 | 151 |
29/08/2021 03:00:14 | 155 |
29/08/2021 03:15:00 | NaN |
29/08/2021 03:30:12 | 152 |
29/08/2021 03:45:00 | NaN |
29/08/2021 04:00:12 | 149 |
29/08/2021 04:15:09 | 152 |
29/08/2021 04:30:16 | 153 |
29/08/2021 04:45:09 | 211 |
29/08/2021 05:00:00 | NaN |
29/08/2021 05:15:00 | NaN |
29/08/2021 05:30:19 | 77 |
CodePudding user response:
Okay, so this was trickier than I expected, but I think I figured it out. Here's my solution:
I created a toy example of your df (please provide this code yourself next time, like stated here for example)
import pandas as pd
import datetime
df = pd.DataFrame()
df['timestamp'] = ['29/08/2021 02:30:16', '29/08/2021 02:45:19', '29/08/2021 03:00:14', '29/08/2021 03:30:12']
df['Power'] = [155,151,155,152]
The df looks like this:
timestamp Power
0 29/08/2021 02:30:16 155
1 29/08/2021 02:45:19 151
2 29/08/2021 03:00:14 155
3 29/08/2021 03:30:12 152
First we convert the timestamp
column into a pandas datetime object, and then replace the axis of the dataframe with it.
df.timestamp = pd.to_datetime(df.timestamp)
df.set_index('timestamp', inplace=True)
This allows us to use resample
on it, but as you have already noticed this will create a completely new range of dates instead of incorporating your own. The way I went around this was by using resample only for each pair of contiguous timestamps. This way it will only add new entries if there is "space" for it between the timestamps.
final_df = pd.DataFrame()
timestamp_list = []
power_list = []
for i, timestamp in enumerate(df.index.to_list()):
temp_df = df[i:i 2].resample('16Min', origin='start').asfreq()
timestamp_list.extend(temp_df.index.to_list())
power_list.extend(temp_df.Power.to_list())
final_df['timestamp'] = timestamp_list
final_df['Power'] = power_list
The result looks like this:
timestamp Power
0 2021-08-29 02:30:16 155.0
1 2021-08-29 02:45:19 151.0
2 2021-08-29 03:00:14 155.0
3 2021-08-29 03:15:14 NaN
4 2021-08-29 03:30:12 152.0
If you want to re-format the date to the exact same format you had before, I recommend looking into the datetime package. Or you can just do it manually by iterating through the column.
CodePudding user response:
To reproduce your data I have done:
import pandas as pd
data = pd.DataFrame.from_records(
[
["29/08/2021 02:30:16", 155],
["29/08/2021 02:45:19", 151],
["29/08/2021 02:47:19", 152],
["29/08/2021 03:00:14", 155],
["29/08/2021 03:30:12", 152],
["29/08/2021 04:00:12", 149],
["29/08/2021 04:15:09", 152],
["29/08/2021 04:30:16", 153],
["29/08/2021 04:45:09", 211],
["29/08/2021 05:30:19", 77]
],
columns=["timestamp", "Power"],
)
data["timestamp"] = pd.to_datetime(data["timestamp"])
In order to fill the gaps I have done the following steps.
First, create a new column with your rounded times stamps:
data["t_rounded"] = data["timestamp"].dt.round("15min")
data.set_index("t_rounded", inplace=True, drop=True)
Make sure there are no duplicated indices by dropping all duplicates and keeping only the first sample:
# drop any duplicated samples which occurred too close
is_duplicate = data.index.duplicated(keep='last')
# keep the duplicates which we are going to remove
duplicates_df = data[is_duplicate]
# remove the duplicates from the original data
data = data[~is_duplicate]
Then, create a new desired equidistant index:
new_index = pd.period_range(data.index.values[0], data.index.values[-1], freq="15min")
new_index = new_index.to_timestamp()
Now use the new index for your data frame:
data = data.reindex(new_index)
data.reset_index(inplace=True)
Next, impose the rounded timestamps (now called index because of the reset_index) to empty times
mask = data["timestamp"].isna()
data.loc[mask, "timestamp"] = data.loc[mask, "index"]
Finally, set the new filled timestamps as index and drop the rounded time column
data.set_index("timestamp", inplace=True, drop=True)
data.drop("index", inplace=True, axis=1)
In case needed, you can add the duplicated time stamps which we removed earlier by doing:
df = duplicates_df.reset_index().set_index("timestamp", drop=True).drop("t_rounded", axis=1)
data = pd.concat([data, df]).sort_index()
The final result looks like
Power
timestamp
2021-08-29 02:30:16 155.0
2021-08-29 02:45:19 151.0
2021-08-29 02:47:19 152.0
2021-08-29 03:00:14 155.0
2021-08-29 03:15:00 NaN
2021-08-29 03:30:12 152.0
2021-08-29 03:45:00 NaN
2021-08-29 04:00:12 149.0
2021-08-29 04:15:09 152.0
2021-08-29 04:30:16 153.0
2021-08-29 04:45:09 211.0
2021-08-29 05:00:00 NaN
2021-08-29 05:15:00 NaN
2021-08-29 05:30:19 77.0