I have a dataset where I would like to:
- group by location and box and take distinct count of the box
- create column headers with the values in the status column and include its count based on the box
Data
ID location type box status
aa NY no box55 hey
aa NY no box55 hi
aa NY yes box66 hello
aa NY yes box66 goodbye
aa CA no box11 hey
aa CA no box11 hi
aa CA yes box11 hello
aa CA yes box11 goodbye
aa CA no box86 hey
aa CA no box86 hi
aa CA yes box86 hello
aa CA yes box99 goodbye
aa CA no box99 hey
aa CA no box99 hi
Desired
location box count box hey hi hello goodbye
NY 2 box55 1 1 0 0
NY 2 box66 0 0 1 1
CA 3 box11 1 1 1 1
CA 3 box86 1 1 1 0
CA 3 box99 1 1 0 1
Doing
df['box count'] = df.groupby(['location','box'])['box'].size()
t = pd.get_dummies(df, prefix_sep='', prefix='',
columns=['status']).groupby(['box', 'location'],
as_index=False).sum().assign(count=df.groupby(['box', 'location'],
as_index=False)['status'].size()['size'])
Any suggestion is appreciated.
CodePudding user response:
Try making two dataframes: first with .groupby()
, second with pd.crosstab
. Then just pd.concat
them:
df1 = df.groupby(["location", "box"]).agg(**{"box count": ("box", "size")})
df2 = pd.crosstab([df["location"], df["box"]], df["status"])
df_out = pd.concat([df1, df2], axis=1)
print(df_out.reset_index())
Prints:
location box box count goodbye hello hey hi
0 CA box11 4 1 1 1 1
1 CA box86 3 0 1 1 1
2 CA box99 3 1 0 1 1
3 NY box55 2 0 0 1 1
4 NY box66 2 1 1 0 0
EDIT:
m = df.groupby(["location"])["box"].nunique()
df1 = df.groupby(["location", "box"]).agg(
**{
"box count": (
"location",
lambda x: m[x.iat[0]],
)
}
)
df2 = pd.crosstab([df["location"], df["box"]], df["status"])
df_out = pd.concat([df1, df2], axis=1)
print(df_out.reset_index())
Prints:
location box box count goodbye hello hey hi
0 CA box11 3 1 1 1 1
1 CA box86 3 0 1 1 1
2 CA box99 3 1 0 1 1
3 NY box55 2 0 0 1 1
4 NY box66 2 1 1 0 0