Home > Blockchain >  pandas: combine size and sum in a single groupby?
pandas: combine size and sum in a single groupby?

Time:08-06

I have a dataframe of houses, one row per house, that looks like this:

data = [
  ['Oxford', 2016, True],
  ['Oxford', 2016, True],
  ['Oxford', 2018, False],
  ['Cambridge', 2016, False],
  ['Cambridge', 2016, True]
]
df = pd.DataFrame(data, columns=['town', 'year', 'is_detached'])
        town  year  is_detached
0     Oxford  2016         True
1     Oxford  2016         True
2     Oxford  2018        False
3  Cambridge  2016        False
4  Cambridge  2016         True

And I want to end up with a table that looks like this:

        town  total_houses_2016  total_houses_2018  is_detached_2016  is_detached_2018
0     Oxford                  2                  1                 2                 0
1  Cambridge                  2                  0                 1                 0

Currently I'm doing two separate groupby calls, and then joining them together:

by_town_totals = df.groupby([df.town, df.year])\
    .size()\
    .reset_index()\
    .pivot(index=["town"], columns="year", values=0).fillna(0)\
    .add_prefix('total_houses_')
by_town_detached = df.groupby([df.town, df.year])\
    .is_detached.sum().reset_index()\
    .pivot(index=["town"], columns="year", values="is_detached").fillna(0)\
    .add_prefix('is_detached_')
by_town = pd.concat([by_town_totals, by_town_detached], axis=1).reset_index()

Is there a way I could do this with a single groupby?

CodePudding user response:

data = [
  ['Oxford', 2016, True],
  ['Oxford', 2016, True],
  ['Oxford', 2018, False],
  ['Cambridge', 2016, False],
  ['Cambridge', 2016, True]
]
df = pd.DataFrame(data, columns=['town', 'year', 'is_detached'])
data = df.groupby(['town', 'year']).agg({'is_detached': 'sum', 'year': 'count'}).rename(columns={'year': 'total_houses'}).reset_index().pivot(index='town', columns='year', values=['is_detached', 'total_houses']).fillna(0)
data.columns = ['_'.join(map(str, col)).strip() for col in data.columns.values]
data.reset_index()[['town','total_houses_2016','total_houses_2018','is_detached_2016','is_detached_2018']]

CodePudding user response:

df.year = df.year.astype(str)
df = df.pivot_table(index='town', 
                    columns='year', 
                    values='is_detached', 
                    aggfunc=['size', 'sum'], 
                    fill_value=0)
df.columns = (df.columns.to_flat_index()
                .str.join('_')
                .str.replace('size','total_houses')
                .str.replace('sum', 'is_detached'))
print(df.reset_index())

Output:

        town  total_houses_2016  total_houses_2018  is_detached_2016  is_detached_2018
0  Cambridge                  2                  0                 1                 0
1     Oxford                  2                  1                 2                 0

CodePudding user response:

Check Below code:

new_df = pd.crosstab(df['town'], df['year'], df['is_detached'], aggfunc=['size','sum']).reset_index().fillna(0)

new_df.columns = ['is_detached_' str(i[1]) if i[0]=='sum' else 'total_houses_' str(i[1]) if i[0]=='size' else i[0] for i in new_df.columns]

new_df

Output:

enter image description here

  • Related