Home > Software engineering >  How to get number of occurrences in pandas dataframe with a condition
How to get number of occurrences in pandas dataframe with a condition

Time:10-25

I have a dataframe that looks like following:

 product  stars
  10717  4
  10717  5
  10717  5
  10717  5
  10717  3
  10717  2
  10717  2
  10711  2
  10711  1
  10711  5
  10711  1
  10711  1
  10711  5
  10711  2

I have thousands rows.
I want to calculate for each distinct product the number of occurences of each star (from 1 to 5).
How can I achieve this?

What I tried is to get the list of the distinct products with: dp = df.product.unique()
Than I looped over it:

for key in dd:    
    sf_1[(sf_1['product_sku'] == key)].value_counts()

The result looks like:

product  stars
10717        5         3
             4         1
             3         1
             2         2
dtype: int64

product  stars
10711        5        2
             2        3
             1        2
dtype: int64

What I need is new dataframe which looks like

product     stars    number_stars
10717         5        3
10717         4        1
10717         3        1
10717         2        2 
10717         1        0
10711         5        2
10711         4        0
10711         3        0
10711         2        3
10711         1        2

CodePudding user response:

this should work:

import pandas as pd

df = pd.DataFrame([
  [10717,  4],
  [10717,  5],
  [10717,  5],
  [10717,  5],
  [10717,  3],
  [10717,  2],
  [10717,  2],
  [10711,  2],
  [10711,  1],
  [10711,  5],
  [10711,  1],
  [10711,  1],
  [10711,  5],
  [10711,  2]
  ],columns=['product','stars']
  )
  
newdf = df.groupby(['product','stars']).size()
newdf = newdf.reset_index()
newdf = newdf.rename(columns={0:'number_stars'})

result:

>>> newdf
   product  stars  number_stars
0    10711      1             3
1    10711      2             2
2    10711      5             2
3    10717      2             2
4    10717      3             1
5    10717      4             1
6    10717      5             3

CodePudding user response:

The groupby and .size do the counting of the number of stars per product. Lines 2 and 3 are just formatting the data to get it to look like you showed in the question, you may not actually need them.

df.groupby(["product", "stars"]).size() \
  .unstack(fill_value=0).stack() \
  .to_frame("number_stars").reset_index()

   product  stars  number_stars
0    10711      1             3
1    10711      2             2
2    10711      3             0
3    10711      4             0
4    10711      5             2
5    10717      1             0
6    10717      2             2
7    10717      3             1
8    10717      4             1
9    10717      5             3

CodePudding user response:

One way to do this is to produce a dummy dataframe of all the combination of products and stars, and then left merge this with the counts from the grouped original dataframe, filling NaN with 0:

stars = pd.DataFrame([(p, i) for p in df['product'].unique() for i in range(5, 0, -1)], columns = ['product', 'stars'])
counts = df.groupby(['product', 'stars']).size().reset_index(name='number_stars')    
result = stars.merge(counts, on=['product', 'stars'], how='left')
result['number_stars'] = result['number_stars'].fillna(0).astype(int)

Output for your sample data:

   product  stars  number_stars
0    10717      5             3
1    10717      4             1
2    10717      3             1
3    10717      2             2
4    10717      1             0
5    10711      5             2
6    10711      4             0
7    10711      3             0
8    10711      2             2
9    10711      1             3

CodePudding user response:

A possible solution, based on pandas.DataFrame.value_counts, pandas.DataFrame.reindex and pandas.MultiIndex.from_product (the reindex and MultiIndex part is only needed to get the zero counts):

stars = list(range(1,6))
cols = ['product', 'stars']

(df.value_counts(cols)
 .reindex(pd.MultiIndex.from_product([df['product'].unique(),stars], names=cols),
          fill_value=0)
 .rename('number_stars').reset_index())

Output:

   product  stars  number_stars
0    10717      1             0
1    10717      2             2
2    10717      3             1
3    10717      4             1
4    10717      5             3
5    10711      1             3
6    10711      2             2
7    10711      3             0
8    10711      4             0
9    10711      5             2
  • Related