How to process data into 20Minute aggregates in Python-CodePudding

I have the folowing table

TimeStamp	Name	Marks	Subject
2022-01-01 00:00:02.969	Chris	70	DK
2022-01-01 00:00:04.467	Chris	75	DK
2022-01-01 00:00:05.965	Mark	80	DK
2022-01-01 00:00:08.962	Cuban	60	DK
2022-01-01 00:00:10.461	Cuban	58	DK

I want to aggregate the table for each column into 20minute aggregate which includes max, min, values

Expected output

TimeStamp	Subject	Chris_Min	Chris_Max
2022-01-01 00:00:00.000	DK	70	75
2022-01-01 00:20:00.000	DK	etc	etc
2022-01-01 00:40:00.000	DK	etc	etc

I am having hard time aggregating the data into required output. The agggregation should be dynamic so as to change to 10min or 30min.

I tried using bins to do it, but not getting the desired results.

Please Help.

CodePudding user response：

You could try the following:

rule = "10min"
result = (
    df.set_index("TimeStamp").groupby(["Name", "Subject"])
      .resample(rule)
      .agg(Min=("Marks", "min"), Max=("Marks", "max"), STD=("Marks", "std"))
      .unstack(0)
      .swaplevel(0, 1).reset_index()
)

First setting TimeStamp as index, and grouping by Subject and Name to get the right chunks to work on.
Then .resampling() the groups with the given frequency rule.
Then aggregating the required stats by using .agg() with named tuples.
Unstacking the first index level (Name) to get it in the columns.
Swapping the remaining index levels to get the right order when finally resetting the index.

Result for the given sample:

      TimeStamp Subject   Min              Max                  STD               
Name                    Chris Cuban Mark Chris Cuban Mark     Chris     Cuban Mark
0    2022-01-01      DK    70    58   80    75    60   80  3.535534  1.414214  NaN

If you want the columns exactly like in your expected output then you could add the following

result = result[
    list(result.columns[:2])   sorted(result.columns[2:], key=lambda c: c[1])
]
result.columns = [f"{lev1}_{lev0}" if lev1 else lev0 for lev0, lev1 in result.columns]

to get

   TimeStamp Subject  Chris_Min  Chris_Max  ...  Cuban_STD  Mark_Min  Mark_Max  Mark_STD
0 2022-01-01      DK         70         75  ...   1.414214        80        80       NaN

If you're getting the TypeError: aggregate() missing 1 required positional argument... error (the comment is gone), then it could be that you're working with an older Pandas version that can't deal with named tuples. You could try the following instead:

rule = "10min"
result = (
    df.set_index("TimeStamp").groupby(["Name", "Subject"])
      .resample(rule)
      .agg({"Marks": ["min", "max", "std"]})
      .droplevel(0, axis=1)
      .unstack(0)
      .swaplevel(0, 1).reset_index()
)
...

CodePudding user response：

Is your table a pandas dataframe ? If it's a pandas dataframe you can use resample:

# only if timestamp is not the index yet:
df = df.set_index('TimeStamp') 
# the important part, you can use any function in agg or some str for simple 
# functions like mean:
df = df.resample('10Min').agg('max','min') 
# only if you had to set index to timestamp and want to go back to normal index:
df = df.reset_index()

Edit to get second table in the function:

# choose aggregation function
agg_functions = ['min', 'max', 'std']
# set_index on time column, resample
resampled_df = df.set_index('TimeStamp').resample('10Min').agg(agg_functions)
# flatten multiindex
resampled_df.columns =  resampled_df.columns.map('_'.join)
# drop time column 
resampled_df = resampled_df.reset_index(drop=True)
# concatenate with original df
pd.concat([df, resampled_df], axis=1)