Home > Software engineering >  Split dataframe column values into windows of size n and keep date information
Split dataframe column values into windows of size n and keep date information

Time:11-02

Current df:

Date                 Power
2011-04-18 17:00:01  245.83
2011-04-18 17:00:02  246.02
2011-04-18 17:00:03  245.72
2011-04-18 17:00:04  244.71
2011-04-18 17:00:05  245.93
2011-04-18 17:00:06  243.12
2011-04-18 17:00:07  244.72
2011-04-18 17:00:08  242.44
2011-04-18 17:00:09  246.42
2011-04-18 17:00:10  245.02
2011-04-18 17:00:11  244.02
2011-04-18 17:00:12  242.02
...                     ...

I want to split the above dataframe into windows of size n=4 (size is variable) and create the following df.

     date start          date end            power
0    2011-04-18 17:00:01 2011-04-18 17:00:04 [245.83, 246.02, 245.72, 244.71]
1    2011-04-18 17:00:05 2011-04-18 17:00:08 [245.93, 243.12, 244.72, 242.44]
2    2011-04-18 17:00:09 2011-04-18 17:00:12 [246.42, 245.02, 244.02, 242.02]
...                  ...                 ...                              ...

Is there any function in pandas that lets you do this? I searched on the documentation and couldn't find anything that would serve this purpose.

CodePudding user response:

setup

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "Date":pd.date_range("2011-04-18 17:00:01", freq="s", periods=11),
        "Power":[245, 246, 245, 244, 245, 243, 244, 242, 246, 245, 244],
    }
)

n = 4

I didn't bother with the decimal places...

solution

def make_row(d):
    return pd.Series({"date_start":d["Date"].min(), "date_end":d["Date"].max(), "power":d["Power"].to_list()})

df.groupby(np.floor(np.linspace(0,len(df)-1,len(df))/n)).apply(make_row)

This results in

             date_start            date_end                 power
0.0 2011-04-18 17:00:01 2011-04-18 17:00:04  [245, 246, 245, 244]
1.0 2011-04-18 17:00:05 2011-04-18 17:00:08  [245, 243, 244, 242]
2.0 2011-04-18 17:00:09 2011-04-18 17:00:11       [246, 245, 244]

explanation

np.floor(np.linspace(0,len(df)-1,len(df))/n)

This piece of code creates an array with n zeroes, followed by n ones, etc up to the length of df

We groupby on this array and pass the sub dataframes to the make_row function which returns a pandas.Series to form a row in the new dataframe.

  • Related