I have a pandas dataframe df
where the 'categories' column contains a variable length of string labels to that row. Each label represents a sub-category:
id categories number
1 'food','toy','science' 1
2 'animal' 2
3 'plant','food','science' 5
....
How could I do a group-by and count by each sub-category size in the column 'categories'?
df.groupby('categories').size()
if calculated this way, id1 and id3 will be grouped differently but id1 and id3 both belong to 'food' sub-category.
I could split 'categories' column into different columns and then do the groupby column by column. Since I have 200 strings in that categories
column, this sounds cumbersome.
Any elegant solution?
if I use df.to_dict
it will be like:
{'id': {0: 1,
1: 2,
2: 3},
'categories': {0: 'food','toy','science',
1: 'animal',
2: 'plant','food','science'}}
If I only use top 3 rows as example, expected output will be:
categories size
food 2
toy 1
animal 1
plant 1
science 2
CodePudding user response:
We can use explode to create a row for each sub-category for category column.
Starting with df DataFrame:
id categories number
0 1 'food','toy' 1
1 2 'animal' 2
2 3 'plant','food','science' 5
Code
# Following steps
df['categories'] = df['categories'].str.split(',') # convert categories from string to list
df2 = df.explode('categories') # explode categories
# category column now only has one sub-category per row
result = df2.groupby('categories').size() # group sub-categories and
# count number of items in each group
result
categories
'animal' 1
'food' 2
'plant' 1
'science' 1
'toy' 1
Name: number, dtype: int64
CodePudding user response:
Let us try str.extractall
then value_counts
out = df.categories.str.extractall("'([^']*)'")[0].value_counts()
Out[947]:
science 2
food 2
plant 1
toy 1
animal 1
Name: 0, dtype: int64