I have the folowing table
TimeStamp | Name | Marks | Subject |
---|---|---|---|
2022-01-01 00:00:02.969 | Chris | 70 | DK |
2022-01-01 00:00:04.467 | Chris | 75 | DK |
2022-01-01 00:00:05.965 | Mark | 80 | DK |
2022-01-01 00:00:08.962 | Cuban | 60 | DK |
2022-01-01 00:00:10.461 | Cuban | 58 | DK |
I want to aggregate the table for each column into 20minute aggregate which includes max, min, values
Expected output
TimeStamp | Subject | Chris_Min | Chris_Max | Chris_STD | Mark_Min | Mark_Max | Mark_STD |
---|---|---|---|---|---|---|---|
2022-01-01 00:00:00.000 | DK | 70 | 75 | ||||
2022-01-01 00:20:00.000 | DK | etc | etc | ||||
2022-01-01 00:40:00.000 | DK | etc | etc |
I am having hard time aggregating the data into required output. The agggregation should be dynamic so as to change to 10min or 30min.
I tried using bins to do it, but not getting the desired results.
Please Help.
CodePudding user response:
You could try the following:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg(Min=("Marks", "min"), Max=("Marks", "max"), STD=("Marks", "std"))
.unstack(0)
.swaplevel(0, 1).reset_index()
)
- First setting
TimeStamp
as index, and grouping bySubject
andName
to get the right chunks to work on. - Then
.resampling()
the groups with the given frequencyrule
. - Then aggregating the required stats by using
.agg()
with named tuples. - Unstacking the first index level (
Name
) to get it in the columns. - Swapping the remaining index levels to get the right order when finally resetting the index.
Result for the given sample:
TimeStamp Subject Min Max STD
Name Chris Cuban Mark Chris Cuban Mark Chris Cuban Mark
0 2022-01-01 DK 70 58 80 75 60 80 3.535534 1.414214 NaN
If you want the columns exactly like in your expected output then you could add the following
result = result[
list(result.columns[:2]) sorted(result.columns[2:], key=lambda c: c[1])
]
result.columns = [f"{lev1}_{lev0}" if lev1 else lev0 for lev0, lev1 in result.columns]
to get
TimeStamp Subject Chris_Min Chris_Max ... Cuban_STD Mark_Min Mark_Max Mark_STD
0 2022-01-01 DK 70 75 ... 1.414214 80 80 NaN
If you're getting the TypeError: aggregate() missing 1 required positional argument...
error (the comment is gone), then it could be that you're working with an older Pandas version that can't deal with named tuples. You could try the following instead:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg({"Marks": ["min", "max", "std"]})
.droplevel(0, axis=1)
.unstack(0)
.swaplevel(0, 1).reset_index()
)
...
CodePudding user response:
Is your table a pandas dataframe ? If it's a pandas dataframe you can use resample:
# only if timestamp is not the index yet:
df = df.set_index('TimeStamp')
# the important part, you can use any function in agg or some str for simple
# functions like mean:
df = df.resample('10Min').agg('max','min')
# only if you had to set index to timestamp and want to go back to normal index:
df = df.reset_index()
Edit to get second table in the function:
# choose aggregation function
agg_functions = ['min', 'max', 'std']
# set_index on time column, resample
resampled_df = df.set_index('TimeStamp').resample('10Min').agg(agg_functions)
# flatten multiindex
resampled_df.columns = resampled_df.columns.map('_'.join)
# drop time column
resampled_df = resampled_df.reset_index(drop=True)
# concatenate with original df
pd.concat([df, resampled_df], axis=1)