Current df:
Date Power
2011-04-18 17:00:01 245.83
2011-04-18 17:00:02 246.02
2011-04-18 17:00:03 245.72
2011-04-18 17:00:04 244.71
2011-04-18 17:00:05 245.93
2011-04-18 17:00:06 243.12
2011-04-18 17:00:07 244.72
2011-04-18 17:00:08 242.44
2011-04-18 17:00:09 246.42
2011-04-18 17:00:10 245.02
2011-04-18 17:00:11 244.02
2011-04-18 17:00:12 242.02
... ...
I want to split the above dataframe into windows of size n=4 (size is variable) and create the following df.
date start date end power
0 2011-04-18 17:00:01 2011-04-18 17:00:04 [245.83, 246.02, 245.72, 244.71]
1 2011-04-18 17:00:05 2011-04-18 17:00:08 [245.93, 243.12, 244.72, 242.44]
2 2011-04-18 17:00:09 2011-04-18 17:00:12 [246.42, 245.02, 244.02, 242.02]
... ... ... ...
Is there any function in pandas that lets you do this? I searched on the documentation and couldn't find anything that would serve this purpose.
CodePudding user response:
setup
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Date":pd.date_range("2011-04-18 17:00:01", freq="s", periods=11),
"Power":[245, 246, 245, 244, 245, 243, 244, 242, 246, 245, 244],
}
)
n = 4
I didn't bother with the decimal places...
solution
def make_row(d):
return pd.Series({"date_start":d["Date"].min(), "date_end":d["Date"].max(), "power":d["Power"].to_list()})
df.groupby(np.floor(np.linspace(0,len(df)-1,len(df))/n)).apply(make_row)
This results in
date_start date_end power
0.0 2011-04-18 17:00:01 2011-04-18 17:00:04 [245, 246, 245, 244]
1.0 2011-04-18 17:00:05 2011-04-18 17:00:08 [245, 243, 244, 242]
2.0 2011-04-18 17:00:09 2011-04-18 17:00:11 [246, 245, 244]
explanation
np.floor(np.linspace(0,len(df)-1,len(df))/n)
This piece of code creates an array with n zeroes, followed by n ones, etc up to the length of df
We groupby on this array and pass the sub dataframes to the make_row
function which returns a pandas.Series
to form a row in the new dataframe.