I have a dataframe that looks like following:
product stars
10717 4
10717 5
10717 5
10717 5
10717 3
10717 2
10717 2
10711 2
10711 1
10711 5
10711 1
10711 1
10711 5
10711 2
I have thousands rows.
I want to calculate for each distinct product the number of occurences of each star (from 1 to 5).
How can I achieve this?
What I tried is to get the list of the distinct products with:
dp = df.product.unique()
Than I looped over it:
for key in dd:
sf_1[(sf_1['product_sku'] == key)].value_counts()
The result looks like:
product stars
10717 5 3
4 1
3 1
2 2
dtype: int64
product stars
10711 5 2
2 3
1 2
dtype: int64
What I need is new dataframe which looks like
product stars number_stars
10717 5 3
10717 4 1
10717 3 1
10717 2 2
10717 1 0
10711 5 2
10711 4 0
10711 3 0
10711 2 3
10711 1 2
CodePudding user response:
this should work:
import pandas as pd
df = pd.DataFrame([
[10717, 4],
[10717, 5],
[10717, 5],
[10717, 5],
[10717, 3],
[10717, 2],
[10717, 2],
[10711, 2],
[10711, 1],
[10711, 5],
[10711, 1],
[10711, 1],
[10711, 5],
[10711, 2]
],columns=['product','stars']
)
newdf = df.groupby(['product','stars']).size()
newdf = newdf.reset_index()
newdf = newdf.rename(columns={0:'number_stars'})
result:
>>> newdf
product stars number_stars
0 10711 1 3
1 10711 2 2
2 10711 5 2
3 10717 2 2
4 10717 3 1
5 10717 4 1
6 10717 5 3
CodePudding user response:
The groupby
and .size
do the counting of the number of stars per product. Lines 2 and 3 are just formatting the data to get it to look like you showed in the question, you may not actually need them.
df.groupby(["product", "stars"]).size() \
.unstack(fill_value=0).stack() \
.to_frame("number_stars").reset_index()
product stars number_stars
0 10711 1 3
1 10711 2 2
2 10711 3 0
3 10711 4 0
4 10711 5 2
5 10717 1 0
6 10717 2 2
7 10717 3 1
8 10717 4 1
9 10717 5 3
CodePudding user response:
One way to do this is to produce a dummy dataframe of all the combination of products and stars, and then left merge this with the counts from the grouped original dataframe, filling NaN
with 0
:
stars = pd.DataFrame([(p, i) for p in df['product'].unique() for i in range(5, 0, -1)], columns = ['product', 'stars'])
counts = df.groupby(['product', 'stars']).size().reset_index(name='number_stars')
result = stars.merge(counts, on=['product', 'stars'], how='left')
result['number_stars'] = result['number_stars'].fillna(0).astype(int)
Output for your sample data:
product stars number_stars
0 10717 5 3
1 10717 4 1
2 10717 3 1
3 10717 2 2
4 10717 1 0
5 10711 5 2
6 10711 4 0
7 10711 3 0
8 10711 2 2
9 10711 1 3
CodePudding user response:
A possible solution, based on pandas.DataFrame.value_counts
, pandas.DataFrame.reindex
and pandas.MultiIndex.from_product
(the reindex
and MultiIndex
part is only needed to get the zero counts):
stars = list(range(1,6))
cols = ['product', 'stars']
(df.value_counts(cols)
.reindex(pd.MultiIndex.from_product([df['product'].unique(),stars], names=cols),
fill_value=0)
.rename('number_stars').reset_index())
Output:
product stars number_stars
0 10717 1 0
1 10717 2 2
2 10717 3 1
3 10717 4 1
4 10717 5 3
5 10711 1 3
6 10711 2 2
7 10711 3 0
8 10711 4 0
9 10711 5 2