Count pattern frequency in pandas dataframe column that has multiple patterns-CodePudding

I have the dataframe below:


details = {
    'container_id' : [1, 2, 3, 4, 5, 6 ],
    'container' : ['black box', 'orange box', 'blue box', 'black box','blue box', 'white box'],
    'fruits' : ['apples, black currant', 'oranges','peaches, oranges', 'apples','apples, peaches, oranges', 'black berries, peaches, oranges, apples'],
}
  
# creating a Dataframe object 

df = pd.DataFrame(details)

I want to find the frequency of each fruit separately on a list.

I tried this code

df['fruits'].str.split(expand=True).stack().value_counts()

but I get the black count 2 times instead of 1 for black currant and 1 for black berries.

CodePudding user response：

You can do it like you did, but with specifying the delimiter. Be aware that when splitting the data, you get some leading whitespace unless your delimiter is a comma with a space. To be sure just use another step with str.strip.

df['fruits'].str.split(',', expand=False).explode().str.strip().value_counts()

your way (you can also use str.strip after the stack command if you want to)

df['fruits'].str.split(', ', expand=True).stack().value_counts()

Output:

apples           4
oranges          4
peaches          3
black currant    1
black berries    1
Name: fruits, dtype: int64

CodePudding user response：

Specify the comma separator followed by an optional space:

df['fruits'].str.split(',\s?', expand=True).stack().value_counts()

OUTPUT:

apples           4
oranges          4
peaches          3
black currant    1
black berries    1
dtype: int64