How can I count instances of a string in a dataframe column of lists that matches the string of a co-CodePudding

I have a dataframe containing a column of produce and a column of a list of colors the produce comes in:

import pandas as pd

data = {'produce':['zucchini','apple','citrus','banana','pear'],
      'colors':['green, yellow','green, red, yellow','orange, yellow ,green','yellow','green, yellow, brown']}
df = pd.DataFrame(data)
print(df)

Dataframe looks like:

    produce                 colors
0  zucchini          green, yellow
1     apple     green, red, yellow
2    citrus  orange, yellow, green
3    banana                 yellow
4      pear   green, yellow, brown

I am trying to create a second dataframe with each color, and count the number of columns in the first dataframe that have that color. I am able to get the unique list of colors into a dataframe:

#Create Dataframe with a column of unique values
unique_colors = df['colors'].str.split(",").explode().unique()
df2 = pd.DataFrame()
df2['Color'] = unique_colors
print(df2)

But some of the colors repeat some of the time:

     Color
0    green
1   yellow
2      red
3   orange
4    green
5   yellow
6    brown

and I am unable to find a way to add a column that counts the instances in the other dataframe. I have tried:

#df['Count'] = data['colors'] == df2['Color']
df['Count'] = ()
for i in df2['Color']:
      count=0
      if df["colors"].str.contains(i):
            count 1
      df['Count']=count

but I get the error "ValueError: Length of values (0) does not match length of index (5)"

How can I

make sure values aren't repeated in the list, and
count the instances of the color in the other dataframe

(This is a simplification of a much larger dataframe, so I can't just edit values in the first dataframe to fix the unique color issue).

CodePudding user response：

You need consider the space around , when split. To count the occurrence of color, you can use Series.value_counts().

out = (df['colors'].str.split(' *, *')
       .explode().value_counts()
       .to_frame('Count')
       .rename_axis('Color')
       .reset_index())

print(out)

    Color  Count
0  yellow      5
1   green      4
2     red      1
3   brown      1
4  orange      1

CodePudding user response：

Proposed script

import operator

y_c = (df['colors'].agg(lambda x: [e.strip() for e in x.split(',')])
                   .explode()
       )

clrs = pd.DataFrame.from_dict({c: [operator.countOf(y_c, c)] for c in y_c.unique()})

Two presentations for the result

1 - Horizontal :

print(clrs.rename(index={0:'count'}))

#        green  yellow  red  orange  brown
# count      4       5    1       1      1

2- Vertical :

print(clrs.T.rename(columns={0:'count'}))

#         count
# green       4
# yellow      5
# red         1
# orange      1
# brown       1