The following is an example of items rated by 1,2 or 3 stars. I am trying to count all combinations of item ratings (stars) per month.
In the following example, item 10 was rated in month 1 and has two ratings equal 1, one rating equal 2 and one rating equal 3.
inp = pd.DataFrame({'month':[1,1,1,1,1,2,2,2],
'item':[10,10,10,10,20,20,20,20],
'star':[1,2,1,3,3,2,2,3]}
)
month item star
0 1 10 1
1 1 10 2
2 1 10 1
3 1 10 3
4 1 20 3
5 2 20 2
6 2 20 2
7 2 20 3
For the given above input frame output should be:
month item star_1_cnt star_2_cnt star_3_cnt
0 1 10 2 1 1
1 1 20 0 0 1
2 2 20 0 2 1
I am trying to solve the problem starting with the following code, which result still needs to be converted to the desired format of the output frame and which gives the wrong answers:
1 20 3 (1, 1)
2 20 3 (1, 1)
Anyway, there should be a better way to create the output table, then finalizing this one:
months = [1,2]
items = [10,20]
stars = [1,2,3]
d = {'month': [], 'item': [], 'star': [], 'star_cnts': [] }
for month in months:
for star in stars:
for item in items:
star_cnts=dict(inp[(inp['item']==item) & (inp['star']==star)].value_counts()).values()
d['month'].append(month)
d['item'].append(item)
d['star'].append(star)
d['star_cnts'].append(star_cnts)
pd.DataFrame(d)
month item star star_cnts
0 1 10 1 (2)
1 1 20 1 ()
2 1 10 2 (1)
3 1 20 2 (2)
4 1 10 3 (1)
5 1 20 3 (1, 1)
6 2 10 1 (2)
7 2 20 1 ()
8 2 10 2 (1)
9 2 20 2 (2)
10 2 10 3 (1)
11 2 20 3 (1, 1)
CodePudding user response:
You can use value_counts
on the whole DataFrame and unstack:
out = (
inp.value_counts()
.unstack('star', fill_value=0)
)
Alternatively, you can use crosstab
.
As it requires 1D data as input, you can use tuples:
cols = ['month','item']
out = pd.crosstab(inp[cols].apply(tuple, axis=1), inp['star'])
out.index = pd.MultiIndex.from_tuples(out.index, names=cols)
Output:
star 1 2 3
month item
1 10 2 1 1
20 0 0 1
2 20 0 2 1
formated output
This was already nicely demonstrated by @ansev
(inp.value_counts()
.unstack('star', fill_value=0)
.rename(lambda c: f'star_{c}_cnt', axis=1)
.reset_index()
.rename_axis(columns=None)
)
Output:
month item star_1_cnt star_2_cnt star_3_cnt
0 1 10 2 1 1
1 1 20 0 0 1
2 2 20 0 2 1
CodePudding user response:
One option, with pivot_table :
(inp
.pivot_table(index=['month', 'item'],
values = 'star',
columns='star',
aggfunc='size',
fill_value = 0)
.rename(columns = lambda col: f"star_{col}_cnt")
.rename_axis(columns = None)
.reset_index()
)
month item star_1_cnt star_2_cnt star_3_cnt
0 1 10 2 1 1
1 1 20 0 0 1
2 2 20 0 2 1
Another option, with groupby(pivot_table is a wrapper around groupby unstack):
(inp
.groupby(['month', 'item', 'star'])
.size()
.unstack(fill_value = 0)
.rename(columns = lambda col: f"star_{col}_cnt")
.rename_axis(columns = None)
.reset_index()
)
month item star_1_cnt star_2_cnt star_3_cnt
0 1 10 2 1 1
1 1 20 0 0 1
2 2 20 0 2 1
Another option, with pd.get_dummies
groupby
:
(pd.get_dummies(inp, columns=['star'])
.groupby(['month', 'item'])
.sum()
.add_suffix('_cnt')
.reset_index()
)
month item star_1_cnt star_2_cnt star_3_cnt
0 1 10 2 1 1
1 1 20 0 0 1
2 2 20 0 2 1
performance wise, only tests can tell - I'd expect pivot_table to be slower than the groupby
CodePudding user response:
Series.value_counts
Series.unstack
to convert to dataframe
out = inp.value_counts()\
.unstack('star', fill_value=0)\
.rename(lambda x: f'star_{x}_cnt', axis=1)\
.reset_index().rename_axis(columns=None)
print(out)
month item star_1_cnt star_2_cnt star_3_cnt
0 1 10 2 1 1
1 1 20 0 0 1
2 2 20 0 2 1