I have the following dataframe:
data = {"timestamp": ["2022-12-15 22:00:00", "2022-12-15 22:00:30", "2022-12-15 22:00:47",
"2022-12-15 22:00:03", "2022-12-15 22:00:30", "2022-12-15 22:00:43",
"2022-12-15 22:00:10", "2022-12-15 22:00:34", "2022-12-15 22:00:59"],
"ID": ["A","A","A",
"B", "B", "B",
"C", "C", "C"],
"value": [11, 0, 0,
7, 5, 7,
0, 3.4, 3.4]
}
df_test = pd.DataFrame(data, columns=["timestamp", "ID", "value"])
df_test["timestamp"] = pd.to_datetime(df_test["timestamp"])
I want to create a new dataframe which for every ID has a row for every second from "2022-12-15 22:00:00" to "2022-12-15 22:01:00" in the same dataframe. So the end dataframe will have 180 rows (60 for each ID, so each rows is one second in the timeinterval.). For the rows which match the timestamp
in df_test
I want the value
and otherwise I want a NaN
value.
I have tried using the following code:
df_resampled = df_test.groupby("ID").resample("S", on="timestamp").sum().reset_index()
But this have the problem that for rows which do not match, 0 is returned instead of NaN
.
CodePudding user response:
This seems to work.
import pandas as pd
from datetime import datetime, timedelta
data = {"timestamp": ["2022-12-15 22:00:00", "2022-12-15 22:00:30", "2022-12-15 22:00:47",
"2022-12-15 22:00:03", "2022-12-15 22:00:30", "2022-12-15 22:00:43",
"2022-12-15 22:00:10", "2022-12-15 22:00:34", "2022-12-15 22:00:59"],
"ID": ["A","A","A",
"B", "B", "B",
"C", "C", "C"],
"value": [11, 0, 0,
7, 5, 7,
0, 3.4, 3.4]
}
df = pd.DataFrame(data, columns=["timestamp", "ID", "value"])
df["timestamp"] = pd.to_datetime(df["timestamp"])
start = datetime.strptime('12/15/22 22:00:00', '%m/%d/%y %H:%M:%S')
end = datetime.strptime('12/15/22 22:01:00', '%m/%d/%y %H:%M:%S')
li = []
for i in range(0, 60):
time_range = pd.date_range(start timedelta(seconds= 1), end - timedelta(seconds = 1), freq = 's')
for id in df.ID.unique():
for i in time_range:
li.append({'ID': id, 'timestamp': i})
df1 = pd.concat([df, pd.DataFrame.from_dict(li)]).drop_duplicates(['ID', 'timestamp']).sort_values(['ID', 'timestamp']).reset_index()
print (df1)
yields:
index timestamp ID value
0 0 2022-12-15 22:00:00 A 11.0
1 0 2022-12-15 22:00:01 A NaN
2 1 2022-12-15 22:00:02 A NaN
3 2 2022-12-15 22:00:03 A NaN
4 3 2022-12-15 22:00:04 A NaN
.. ... ... .. ...
173 172 2022-12-15 22:00:55 C NaN
174 173 2022-12-15 22:00:56 C NaN
175 174 2022-12-15 22:00:57 C NaN
176 175 2022-12-15 22:00:58 C NaN
177 8 2022-12-15 22:00:59 C 3.4
CodePudding user response:
The "value" issue itself could be fixed as follows:
res = (df_test.set_index('timestamp')
.groupby('ID')
.resample('S')
.asfreq()['value']
.reset_index())
res.shape
# (139, 3) N.B. Wrong start and end!
However, this won't solve another problem that consists of the fact that a simple resample
will start/end with the first/last timestamp for each ID
, and in your example these are not always 22:00:00
and 22:00:59
.
Here's an alternative approach:
- Use
pd.date_range
to create a range for the 60 seconds. - Use
pd.MultiIndex.from_product
to create a new index that combinesSeries.unique
applied to columnID
and the date range we have just created (rng
). (Make sure to use the ranges in this order!) - Now, we can simply use
df.set_index
to makeID
andtimestamp
the index, applydf.reindex
, and finally, applydf.reset_index
again.
rng = pd.date_range(start='2022-12-15 22:00:00', end='2022-12-15 22:00:59',
freq="S")
multi_index = pd.MultiIndex.from_product([df_test.ID.unique(), rng],
names=['ID', 'timestamp'])
res = df_test.set_index(['ID','timestamp']).reindex(multi_index).reset_index()
# check result
res[(res.value.notna()) |
(res.timestamp.isin(['2022-12-15 22:00:00', '2022-12-15 22:00:59']))]
ID timestamp value
0 A 2022-12-15 22:00:00 11.0
30 A 2022-12-15 22:00:30 0.0
47 A 2022-12-15 22:00:47 0.0
59 A 2022-12-15 22:00:59 NaN
60 B 2022-12-15 22:00:00 NaN
63 B 2022-12-15 22:00:03 7.0
90 B 2022-12-15 22:00:30 5.0
103 B 2022-12-15 22:00:43 7.0
119 B 2022-12-15 22:00:59 NaN
120 C 2022-12-15 22:00:00 NaN
130 C 2022-12-15 22:00:10 0.0
154 C 2022-12-15 22:00:34 3.4
179 C 2022-12-15 22:00:59 3.4
# Note that all the zeros are still there,
# and that each `ID` starts/ends with the correct timestamp
res.shape
# (180, 3)