My dataframes are in a "wide" format, such that columns contain data for different entities, and rows are dates, e.g. product costs through time:
import pandas as pd
cost = pd.DataFrame([[10,20,30,40],[11,22,33,44],[111,222,333,444]], columns=['Knife','Fork','Spoon','Spatula'], index=pd.date_range('2022-01-01','2022-01-03'))
Knife Fork Spoon Spatula
2022-01-01 10 20 30 40
2022-01-02 11 22 33 44
2022-01-03 111 222 333 444
I have a similar dataframe that places each data point in one of two groups (1 or 2):
grp = pd.DataFrame([[1,2,1,2],[1,1,2,2],[2,1,1,2]], columns=['Knife','Fork','Spoon','Spatula'], index=pd.date_range('2022-01-01','2022-01-03'))
Knife Fork Spoon Spatula
2022-01-01 1 2 1 2
2022-01-02 1 1 2 2
2022-01-03 2 1 1 2
I want to do an aggregate groupby-style mean on each of the groups for each date. As such, this is my desired output, showing the averages for groups 1 and 2:
avg = pd.DataFrame([[(10 30)/2,(20 40)/2],[(11 22)/2,(33 44)/2],[(111 444)/2,(222 333)/2]], columns=[1,2], index=pd.date_range('2022-01-01','2022-01-03'))
1 2
2022-01-01 20.0 30.0
2022-01-02 16.5 38.5
2022-01-03 277.5 277.5
I've tried to melt each dataframe, reindex, merge, groupby, aggregate and pivot but it's a real mess (esp the reindexing).
It seems to me there must be a better way to aggregate this kind of "wide" data? My actual data is way bigger (and with more groups), hence a desire for efficiency.
How can I most efficiently gain my desired output?
CodePudding user response:
The plan is to stack
each frame, join
them, groupby
by date/group, and unstack
:
dfm = cost.stack().to_frame(name = 'cost').join(grp.stack().to_frame(name = 'group'))
dfm.reset_index().groupby(['level_0','group']).mean().unstack()
output:
cost
group 1 2
level_0
2022-01-01 20.0 30.0
2022-01-02 16.5 38.5
2022-01-03 277.5 277.5
CodePudding user response:
One approach to try:
# Convert each DataFrame into a MultiIndexed Series,
# with dates in the outer level and items in inner level
cost_long = cost.stack().rename('cost')
group_long = grp.stack().rename('group')
# Align long Series on date and item,
# name the MultiIndex levels for clarity,
# calculate mean cost by date and group,
# unstack inner level of MultiIndex (group) into columns
res = (pd.concat([cost_long, group_long], axis=1)
.rename_axis(['date', 'item'])
.groupby(['date', 'group'])['cost']
.mean()
.unstack()
)
# Delete names of index and columns to match expected output
res.index.name = None
res.columns.name = None
print(res)
1 2
2022-01-01 20.0 30.0
2022-01-02 16.5 38.5
2022-01-03 277.5 277.5
CodePudding user response:
I wanted to understand @piterbarg's answer, so I reformatted it. I'm sharing that here, in case it's helpful for anyone else.
stacked_cost = (
cost
.rename_axis('date')
.stack()
.to_frame(name='cost')
)
stacked_grp = (
grp
.rename_axis('date')
.stack()
.to_frame(name='group')
)
avg = (
stacked_cost
.join(stacked_grp)
.reset_index()
.groupby(['date','group'])
.mean()
.unstack()
)
print(avg)
cost
group 1 2
date
2022-01-01 20.0 30.0
2022-01-02 16.5 38.5
2022-01-03 277.5 277.5
CodePudding user response:
I've tried to melt each dataframe, reindex, merge, groupby, aggregate and pivot but it's a real mess (esp the reindexing).
So, is this what you tried?
import pandas as pd
cost_data = [[10,20,30,40],[11,22,33,44],[111,222,333,444]]
grp_data = [[1,2,1,2],[1,1,2,2],[2,1,1,2]]
utensils = ['Knife','Fork','Spoon','Spatula']
dates = pd.date_range('2022-01-01','2022-01-03')
cost = pd.DataFrame(cost_data, columns=utensils, index=dates)
grp = pd.DataFrame(grp, columns=utensils, index=dates)
cost_melted = (
cost
.melt(var_name='utensil', value_name='cost', ignore_index=False)
.rename_axis('date')
.reset_index()
)
grp_melted = (
grp
.melt(var_name='utensil', value_name='group', ignore_index=False)
.rename_axis('date')
.reset_index()
)
avg = (
cost_melted
.merge(grp_melted, on=['date', 'utensil'])
.groupby(['date', 'group'])
.mean()
.reset_index()
.pivot('date', 'group', 'cost')
)
print(avg)
That prints your expected output:
group 1 2
date
2022-01-01 20.0 30.0
2022-01-02 16.5 38.5
2022-01-03 277.5 277.5
CodePudding user response:
Or....you can just do this:
avg = pd.concat([((grp == 1) * cost).replace(0, np.NaN).mean(axis=1),
((grp == 2) * cost).replace(0, np.NaN).mean(axis=1)],axis=1)
avg.columns = [1,2]
Output:
1 2
2022-01-01 20.0 30.0
2022-01-02 16.5 38.5
2022-01-03 277.5 277.5