I have a dataframe containing a column of produce and a column of a list of colors the produce comes in:
import pandas as pd
data = {'produce':['zucchini','apple','citrus','banana','pear'],
'colors':['green, yellow','green, red, yellow','orange, yellow ,green','yellow','green, yellow, brown']}
df = pd.DataFrame(data)
print(df)
Dataframe looks like:
produce colors
0 zucchini green, yellow
1 apple green, red, yellow
2 citrus orange, yellow, green
3 banana yellow
4 pear green, yellow, brown
I am trying to create a second dataframe with each color, and count the number of columns in the first dataframe that have that color. I am able to get the unique list of colors into a dataframe:
#Create Dataframe with a column of unique values
unique_colors = df['colors'].str.split(",").explode().unique()
df2 = pd.DataFrame()
df2['Color'] = unique_colors
print(df2)
But some of the colors repeat some of the time:
Color
0 green
1 yellow
2 red
3 orange
4 green
5 yellow
6 brown
and I am unable to find a way to add a column that counts the instances in the other dataframe. I have tried:
#df['Count'] = data['colors'] == df2['Color']
df['Count'] = ()
for i in df2['Color']:
count=0
if df["colors"].str.contains(i):
count 1
df['Count']=count
but I get the error "ValueError: Length of values (0) does not match length of index (5)"
How can I
- make sure values aren't repeated in the list, and
- count the instances of the color in the other dataframe
(This is a simplification of a much larger dataframe, so I can't just edit values in the first dataframe to fix the unique color issue).
CodePudding user response:
You need consider the space around ,
when split. To count the occurrence of color, you can use Series.value_counts()
.
out = (df['colors'].str.split(' *, *')
.explode().value_counts()
.to_frame('Count')
.rename_axis('Color')
.reset_index())
print(out)
Color Count
0 yellow 5
1 green 4
2 red 1
3 brown 1
4 orange 1
CodePudding user response:
Proposed script
import operator
y_c = (df['colors'].agg(lambda x: [e.strip() for e in x.split(',')])
.explode()
)
clrs = pd.DataFrame.from_dict({c: [operator.countOf(y_c, c)] for c in y_c.unique()})
Two presentations for the result
1 - Horizontal :
print(clrs.rename(index={0:'count'}))
# green yellow red orange brown
# count 4 5 1 1 1
2- Vertical :
print(clrs.T.rename(columns={0:'count'}))
# count
# green 4
# yellow 5
# red 1
# orange 1
# brown 1