Home > Software engineering >  Pandas groupby with "Wide" data
Pandas groupby with "Wide" data

Time:02-11

My dataframes are in a "wide" format, such that columns contain data for different entities, and rows are dates, e.g. product costs through time:

import pandas as pd

cost = pd.DataFrame([[10,20,30,40],[11,22,33,44],[111,222,333,444]], columns=['Knife','Fork','Spoon','Spatula'], index=pd.date_range('2022-01-01','2022-01-03'))
            Knife  Fork  Spoon  Spatula
2022-01-01     10    20     30       40
2022-01-02     11    22     33       44
2022-01-03    111   222    333      444

I have a similar dataframe that places each data point in one of two groups (1 or 2):

grp = pd.DataFrame([[1,2,1,2],[1,1,2,2],[2,1,1,2]], columns=['Knife','Fork','Spoon','Spatula'], index=pd.date_range('2022-01-01','2022-01-03'))
            Knife  Fork  Spoon  Spatula
2022-01-01      1     2      1        2
2022-01-02      1     1      2        2
2022-01-03      2     1      1        2

I want to do an aggregate groupby-style mean on each of the groups for each date. As such, this is my desired output, showing the averages for groups 1 and 2:

avg =  pd.DataFrame([[(10 30)/2,(20 40)/2],[(11 22)/2,(33 44)/2],[(111 444)/2,(222 333)/2]], columns=[1,2], index=pd.date_range('2022-01-01','2022-01-03'))
                1      2
2022-01-01   20.0   30.0
2022-01-02   16.5   38.5
2022-01-03  277.5  277.5

I've tried to melt each dataframe, reindex, merge, groupby, aggregate and pivot but it's a real mess (esp the reindexing).

It seems to me there must be a better way to aggregate this kind of "wide" data? My actual data is way bigger (and with more groups), hence a desire for efficiency.

How can I most efficiently gain my desired output?

CodePudding user response:

The plan is to stack each frame, join them, groupby by date/group, and unstack:

dfm = cost.stack().to_frame(name = 'cost').join(grp.stack().to_frame(name = 'group'))
dfm.reset_index().groupby(['level_0','group']).mean().unstack()

output:

    cost
group          1    2
   level_0      
2022-01-01  20.0    30.0
2022-01-02  16.5    38.5
2022-01-03  277.5   277.5

CodePudding user response:

One approach to try:

# Convert each DataFrame into a MultiIndexed Series,
# with dates in the outer level and items in inner level
cost_long = cost.stack().rename('cost')
group_long = grp.stack().rename('group')

# Align long Series on date and item, 
# name the MultiIndex levels for clarity, 
# calculate mean cost by date and group, 
# unstack inner level of MultiIndex (group) into columns
res = (pd.concat([cost_long, group_long], axis=1) 
         .rename_axis(['date', 'item'])           
         .groupby(['date', 'group'])['cost']
         .mean()
         .unstack()
      )

# Delete names of index and columns to match expected output
res.index.name = None
res.columns.name = None

print(res)
                1      2
2022-01-01   20.0   30.0
2022-01-02   16.5   38.5
2022-01-03  277.5  277.5

CodePudding user response:

I wanted to understand @piterbarg's answer, so I reformatted it. I'm sharing that here, in case it's helpful for anyone else.

stacked_cost = (
    cost
    .rename_axis('date')
    .stack()
    .to_frame(name='cost')
)

stacked_grp = (
    grp
    .rename_axis('date')
    .stack()
    .to_frame(name='group')
)

avg = (
    stacked_cost
    .join(stacked_grp)
    .reset_index()
    .groupby(['date','group'])
    .mean()
    .unstack()
)

print(avg)
             cost
group           1      2
date
2022-01-01   20.0   30.0
2022-01-02   16.5   38.5
2022-01-03  277.5  277.5

CodePudding user response:

I've tried to melt each dataframe, reindex, merge, groupby, aggregate and pivot but it's a real mess (esp the reindexing).

So, is this what you tried?

import pandas as pd

cost_data = [[10,20,30,40],[11,22,33,44],[111,222,333,444]]
grp_data = [[1,2,1,2],[1,1,2,2],[2,1,1,2]]
utensils = ['Knife','Fork','Spoon','Spatula']
dates = pd.date_range('2022-01-01','2022-01-03')
cost = pd.DataFrame(cost_data, columns=utensils, index=dates)
grp = pd.DataFrame(grp, columns=utensils, index=dates)

cost_melted = (
    cost
    .melt(var_name='utensil', value_name='cost', ignore_index=False)
    .rename_axis('date')
    .reset_index()
)
grp_melted = (
    grp
    .melt(var_name='utensil', value_name='group', ignore_index=False)
    .rename_axis('date')
    .reset_index()
)
avg = (
    cost_melted
    .merge(grp_melted, on=['date', 'utensil'])
    .groupby(['date', 'group'])
    .mean()
    .reset_index()
    .pivot('date', 'group', 'cost')
)

print(avg)

That prints your expected output:

group           1      2
date
2022-01-01   20.0   30.0
2022-01-02   16.5   38.5
2022-01-03  277.5  277.5

CodePudding user response:

Or....you can just do this:

avg = pd.concat([((grp == 1) * cost).replace(0, np.NaN).mean(axis=1), 
                 ((grp == 2) * cost).replace(0, np.NaN).mean(axis=1)],axis=1)
avg.columns = [1,2]

Output:

                1       2
2022-01-01   20.0    30.0
2022-01-02   16.5    38.5
2022-01-03  277.5   277.5
  • Related