I have this dataframe
df = pd.DataFrame({
"col1": ["Kev", "Kev", "Fr"],
"col2": ["Red; Purple", "Yellow; Purple; Red", "Red; Yellow"], }, index=["1", "2", "3"])
It'll look like this
col1 col2
1 Kev Red; Purple
2 Kev Yellow; Purple; Red
3 Fr Red; Yellow
I want to count all the items in col2 according to col1. In this case the final df will be like this:
col1 col2 count
1 Kev Red 2
2 Kev Purple 2
3 Kev Yellow 1
4 Fr Red 1
5 Fr Yellow 1
I tried using explode:
df2 = (df.explode(df.columns.tolist())
.apply(lambda col: col.str.split(';'))
.explode('col1')
.explode('col2'))
but that only gives me col1 and col2 of my desired dataframe, not the count. If I use crosstab
on df2, I'll get a very different result.
I managed to get the desired output with 2 nested for loops, but my dataframe is so big that it takes almost a minute loading the function. I want to avoid this solution.
CodePudding user response:
According to your example, you just need to explode col2
after spliting the strings.
Here is a simpler way using DataFrame.value_counts
import pandas as pd
df = pd.DataFrame({
"col1": ["Kev", "Kev", "Fr"],
"col2": ["Red; Purple", "Yellow; Purple; Red", "Red; Yellow"], }, index=["1", "2", "3"])
df2 = (
df.assign(col2=df['col2'].str.split('; '))
.explode('col2')
.value_counts()
.rename('count')
.reset_index()
)
Output:
>>> df2
col1 col2 count
0 Kev Purple 2
1 Kev Red 2
2 Fr Red 1
3 Fr Yellow 1
4 Kev Yellow 1
CodePudding user response:
After pd.crosstab
, you can try melt
df2 = (df.explode(df.columns.tolist())
.apply(lambda col: col.str.split('; ')) # <-- space here
.explode('col1')
.explode('col2'))
out = (pd.crosstab(df2['col1'], df2['col2'])
.melt(value_name='count', ignore_index=False)
.reset_index())
print(out)
col1 col2 count
0 Fr Purple 0
1 Kev Purple 2
2 Fr Red 1
3 Kev Red 2
4 Fr Yellow 1
5 Kev Yellow 1