Home > Enterprise >  How to process data into 20Minute aggregates in Python
How to process data into 20Minute aggregates in Python

Time:08-12

I have the folowing table

TimeStamp Name Marks Subject
2022-01-01 00:00:02.969 Chris 70 DK
2022-01-01 00:00:04.467 Chris 75 DK
2022-01-01 00:00:05.965 Mark 80 DK
2022-01-01 00:00:08.962 Cuban 60 DK
2022-01-01 00:00:10.461 Cuban 58 DK

I want to aggregate the table for each column into 20minute aggregate which includes max, min, values

Expected output

TimeStamp Subject Chris_Min Chris_Max Chris_STD Mark_Min Mark_Max Mark_STD
2022-01-01 00:00:00.000 DK 70 75
2022-01-01 00:20:00.000 DK etc etc
2022-01-01 00:40:00.000 DK etc etc

I am having hard time aggregating the data into required output. The agggregation should be dynamic so as to change to 10min or 30min.

I tried using bins to do it, but not getting the desired results.

Please Help.

CodePudding user response:

You could try the following:

rule = "10min"
result = (
    df.set_index("TimeStamp").groupby(["Name", "Subject"])
      .resample(rule)
      .agg(Min=("Marks", "min"), Max=("Marks", "max"), STD=("Marks", "std"))
      .unstack(0)
      .swaplevel(0, 1).reset_index()
)
  • First setting TimeStamp as index, and grouping by Subject and Name to get the right chunks to work on.
  • Then .resampling() the groups with the given frequency rule.
  • Then aggregating the required stats by using .agg() with named tuples.
  • Unstacking the first index level (Name) to get it in the columns.
  • Swapping the remaining index levels to get the right order when finally resetting the index.

Result for the given sample:

      TimeStamp Subject   Min              Max                  STD               
Name                    Chris Cuban Mark Chris Cuban Mark     Chris     Cuban Mark
0    2022-01-01      DK    70    58   80    75    60   80  3.535534  1.414214  NaN

If you want the columns exactly like in your expected output then you could add the following

result = result[
    list(result.columns[:2])   sorted(result.columns[2:], key=lambda c: c[1])
]
result.columns = [f"{lev1}_{lev0}" if lev1 else lev0 for lev0, lev1 in result.columns]

to get

   TimeStamp Subject  Chris_Min  Chris_Max  ...  Cuban_STD  Mark_Min  Mark_Max  Mark_STD
0 2022-01-01      DK         70         75  ...   1.414214        80        80       NaN

If you're getting the TypeError: aggregate() missing 1 required positional argument... error (the comment is gone), then it could be that you're working with an older Pandas version that can't deal with named tuples. You could try the following instead:

rule = "10min"
result = (
    df.set_index("TimeStamp").groupby(["Name", "Subject"])
      .resample(rule)
      .agg({"Marks": ["min", "max", "std"]})
      .droplevel(0, axis=1)
      .unstack(0)
      .swaplevel(0, 1).reset_index()
)
...

CodePudding user response:

Is your table a pandas dataframe ? If it's a pandas dataframe you can use resample:

# only if timestamp is not the index yet:
df = df.set_index('TimeStamp') 
# the important part, you can use any function in agg or some str for simple 
# functions like mean:
df = df.resample('10Min').agg('max','min') 
# only if you had to set index to timestamp and want to go back to normal index:
df = df.reset_index() 

Edit to get second table in the function:

# choose aggregation function
agg_functions = ['min', 'max', 'std']
# set_index on time column, resample
resampled_df = df.set_index('TimeStamp').resample('10Min').agg(agg_functions)
# flatten multiindex
resampled_df.columns =  resampled_df.columns.map('_'.join)
# drop time column 
resampled_df = resampled_df.reset_index(drop=True)
# concatenate with original df
pd.concat([df, resampled_df], axis=1)
  • Related