Pandas groupby, resample, return NaN not 0-CodePudding

I have the following dataframe:

data = {"timestamp": ["2022-12-15 22:00:00", "2022-12-15 22:00:30", "2022-12-15 22:00:47", 
                        "2022-12-15 22:00:03", "2022-12-15 22:00:30", "2022-12-15 22:00:43", 
                        "2022-12-15 22:00:10", "2022-12-15 22:00:34", "2022-12-15 22:00:59"],
        "ID": ["A","A","A",
                "B", "B", "B",
                "C", "C", "C"],
        "value": [11, 0, 0,
                    7, 5, 7,
                    0, 3.4, 3.4]
    }

df_test = pd.DataFrame(data, columns=["timestamp", "ID", "value"])
df_test["timestamp"] = pd.to_datetime(df_test["timestamp"])

I want to create a new dataframe which for every ID has a row for every second from "2022-12-15 22:00:00" to "2022-12-15 22:01:00" in the same dataframe. So the end dataframe will have 180 rows (60 for each ID, so each rows is one second in the timeinterval.). For the rows which match the timestamp in df_test I want the value and otherwise I want a NaN value.

I have tried using the following code:

df_resampled = df_test.groupby("ID").resample("S", on="timestamp").sum().reset_index()

But this have the problem that for rows which do not match, 0 is returned instead of NaN.

CodePudding user response：

This seems to work.

import pandas as pd
from datetime import datetime, timedelta

data = {"timestamp": ["2022-12-15 22:00:00", "2022-12-15 22:00:30", "2022-12-15 22:00:47", 
                    "2022-12-15 22:00:03", "2022-12-15 22:00:30", "2022-12-15 22:00:43", 
                    "2022-12-15 22:00:10", "2022-12-15 22:00:34", "2022-12-15 22:00:59"],
    "ID": ["A","A","A",
            "B", "B", "B",
            "C", "C", "C"],
    "value": [11, 0, 0,
                7, 5, 7,
                0, 3.4, 3.4]
}

df = pd.DataFrame(data, columns=["timestamp", "ID", "value"])
df["timestamp"] = pd.to_datetime(df["timestamp"])

start = datetime.strptime('12/15/22 22:00:00', '%m/%d/%y %H:%M:%S')
end = datetime.strptime('12/15/22 22:01:00', '%m/%d/%y %H:%M:%S')
li = []

for i in range(0, 60):
    time_range = pd.date_range(start   timedelta(seconds= 1), end - timedelta(seconds = 1), freq = 's')

for id in df.ID.unique():
    for i in time_range:
        li.append({'ID': id, 'timestamp': i})

df1 = pd.concat([df, pd.DataFrame.from_dict(li)]).drop_duplicates(['ID', 'timestamp']).sort_values(['ID', 'timestamp']).reset_index()

print (df1)

yields:

     index           timestamp ID  value
0        0 2022-12-15 22:00:00  A   11.0
1        0 2022-12-15 22:00:01  A    NaN
2        1 2022-12-15 22:00:02  A    NaN
3        2 2022-12-15 22:00:03  A    NaN
4        3 2022-12-15 22:00:04  A    NaN
..     ...                 ... ..    ...
173    172 2022-12-15 22:00:55  C    NaN
174    173 2022-12-15 22:00:56  C    NaN
175    174 2022-12-15 22:00:57  C    NaN
176    175 2022-12-15 22:00:58  C    NaN
177      8 2022-12-15 22:00:59  C    3.4

CodePudding user response：

The "value" issue itself could be fixed as follows:

res = (df_test.set_index('timestamp')
       .groupby('ID')
       .resample('S')
       .asfreq()['value']
       .reset_index())

res.shape
# (139, 3) N.B. Wrong start and end!

However, this won't solve another problem that consists of the fact that a simple resample will start/end with the first/last timestamp for each ID, and in your example these are not always 22:00:00 and 22:00:59.

Here's an alternative approach:

Use pd.date_range to create a range for the 60 seconds.
Use pd.MultiIndex.from_product to create a new index that combines Series.unique applied to column ID and the date range we have just created (rng). (Make sure to use the ranges in this order!)
Now, we can simply use df.set_index to make ID and timestamp the index, apply df.reindex, and finally, apply df.reset_index again.

rng = pd.date_range(start='2022-12-15 22:00:00', end='2022-12-15 22:00:59', 
                    freq="S")

multi_index = pd.MultiIndex.from_product([df_test.ID.unique(), rng],
                                         names=['ID', 'timestamp'])

res = df_test.set_index(['ID','timestamp']).reindex(multi_index).reset_index()

# check result
res[(res.value.notna()) | 
    (res.timestamp.isin(['2022-12-15 22:00:00', '2022-12-15 22:00:59']))]

    ID           timestamp  value
0    A 2022-12-15 22:00:00   11.0
30   A 2022-12-15 22:00:30    0.0
47   A 2022-12-15 22:00:47    0.0
59   A 2022-12-15 22:00:59    NaN
60   B 2022-12-15 22:00:00    NaN
63   B 2022-12-15 22:00:03    7.0
90   B 2022-12-15 22:00:30    5.0
103  B 2022-12-15 22:00:43    7.0
119  B 2022-12-15 22:00:59    NaN
120  C 2022-12-15 22:00:00    NaN
130  C 2022-12-15 22:00:10    0.0
154  C 2022-12-15 22:00:34    3.4
179  C 2022-12-15 22:00:59    3.4

# Note that all the zeros are still there, 
# and that each `ID` starts/ends with the correct timestamp

res.shape
# (180, 3)