Home > Back-end >  grouby function with filtered conditions
grouby function with filtered conditions

Time:04-01

f = pd.DataFrame({'Movie': ['name1','name2','name3']
                  'genre': [['comedy', 'action'];['comedy','scifi']; 
                            ['thriller','action']]
                  'distributor': ['disney', 'disney','parmount'})

#what if the genre has multiple values in it now name is both part of genre[0] and genre[1] if i use groupby

res = f[f['distributor'] == 'disney'].groupby(['genre'])

desired output

only want movie launched by disney

distributor     genre     count of movies
   disney        action        1
   disney        comedy        2
   disney         scifi         1

CodePudding user response:

Easy and clear to do with datar that reimagines pandas' APIs:

>>> import pandas as pd
>>> df = pd.DataFrame({'Movie': ['name1','name2','name3'],
...                   'genre': [['comedy', 'action'], ['comedy','scifi'],
...                             ['thriller','action']],
...                   'distributor': ['disney', 'disney','parmount']})
>>> df
   Movie               genre distributor
0  name1    [comedy, action]      disney
1  name2     [comedy, scifi]      disney
2  name3  [thriller, action]    paramount
>>>
>>> from datar.all import f, filter, unchop, count
[2022-03-31 11:47:44][datar][WARNING] Builtin name "filter" has been overriden by datar.
>>> (
...     df 
...     >> filter(f.distributor == "disney") 
...     >> unchop(f.genre) 
...     >> count(f.distributor, f.genre)
... )
  distributor    genre       n
     <object> <object> <int64>
0      disney   comedy       2
1      disney   action       1
2      disney    scifi       1
[TibbleGrouped: distributor (n=1)]

CodePudding user response:

Explode your list then count values:

out = df.loc[df['distributor'] == 'disney', 'genre'].explode().value_counts()
print(out)

# Output
comedy    2
action    1
scifi     1
Name: genre, dtype: int64

Update

out = (df.explode('genre').query("distributor == 'disney'")
        .value_counts(['distributor', 'genre'], sort=False)
        .rename('count').reset_index())
print(out)

# Output
  distributor   genre  count
0      disney  action      1
1      disney  comedy      2
2      disney   scifi      1

Update 2

It seems your genre column does not contain a list but a strings. Try to convert this column as a list with ast.literal_eval before using the code above:

import ast

df['genre'] = df['genre'].apply(ast.literal_eval)

# OR

df['genre'] = pd.eval(df['genre'])

# Execute now df.explode(...)...
  • Related