Home > OS >  python dataframe group by column unique values and count group size
python dataframe group by column unique values and count group size

Time:03-22

I have a pandas dataframe df where the 'categories' column contains a variable length of string labels to that row. Each label represents a sub-category:

id         categories               number

1         'food','toy','science'    1
2         'animal'                  2
3         'plant','food','science'  5
....

How could I do a group-by and count by each sub-category size in the column 'categories'?

df.groupby('categories').size()

if calculated this way, id1 and id3 will be grouped differently but id1 and id3 both belong to 'food' sub-category.

I could split 'categories' column into different columns and then do the groupby column by column. Since I have 200 strings in that categories column, this sounds cumbersome. Any elegant solution?

if I use df.to_dict it will be like:

{'id': {0: 1,
  1: 2,
  2: 3},
 'categories': {0: 'food','toy','science', 
                1: 'animal', 
                2: 'plant','food','science'}} 

If I only use top 3 rows as example, expected output will be:

categories      size
food            2
toy             1
animal          1
plant           1
science         2

CodePudding user response:

We can use explode to create a row for each sub-category for category column.

Starting with df DataFrame:

    id  categories  number
0   1   'food','toy'    1
1   2   'animal'    2
2   3   'plant','food','science'    5

Code

# Following steps
df['categories'] = df['categories'].str.split(',')    # convert categories from string to list
df2 = df.explode('categories')                        # explode categories
                                                      # category column now only has one sub-category per row
result = df2.groupby('categories').size()             # group sub-categories and 
                                                      # count number of items in each group

result

categories
'animal'     1
'food'       2
'plant'      1
'science'    1
'toy'        1
Name: number, dtype: int64

CodePudding user response:

Let us try str.extractall then value_counts

out = df.categories.str.extractall("'([^']*)'")[0].value_counts()
Out[947]: 
science    2
food       2
plant      1
toy        1
animal     1
Name: 0, dtype: int64
  • Related