Home > Software engineering >  Count columns with multiple values
Count columns with multiple values

Time:05-30

I have this dataframe

df = pd.DataFrame({
    "col1": ["Kev", "Kev", "Fr"],
    "col2": ["Red; Purple", "Yellow; Purple; Red", "Red; Yellow"], }, index=["1", "2", "3"])

It'll look like this

    col1 col2
1   Kev  Red; Purple
2   Kev  Yellow; Purple; Red
3   Fr   Red; Yellow

I want to count all the items in col2 according to col1. In this case the final df will be like this:

    col1 col2   count
1   Kev  Red    2
2   Kev  Purple 2
3   Kev  Yellow 1
4   Fr   Red    1
5   Fr   Yellow 1

I tried using explode:

df2 = (df.explode(df.columns.tolist())
      .apply(lambda col: col.str.split(';'))
      .explode('col1')
      .explode('col2'))

but that only gives me col1 and col2 of my desired dataframe, not the count. If I use crosstab on df2, I'll get a very different result.

I managed to get the desired output with 2 nested for loops, but my dataframe is so big that it takes almost a minute loading the function. I want to avoid this solution.

CodePudding user response:

According to your example, you just need to explode col2 after spliting the strings.

Here is a simpler way using DataFrame.value_counts

import pandas as pd

df = pd.DataFrame({
    "col1": ["Kev", "Kev", "Fr"],
    "col2": ["Red; Purple", "Yellow; Purple; Red", "Red; Yellow"], }, index=["1", "2", "3"])


df2 = (
    df.assign(col2=df['col2'].str.split('; '))
      .explode('col2')
      .value_counts()
      .rename('count')
      .reset_index()
)

Output:

>>> df2 

  col1    col2  count
0  Kev  Purple      2
1  Kev     Red      2
2   Fr     Red      1
3   Fr  Yellow      1
4  Kev  Yellow      1

CodePudding user response:

After pd.crosstab, you can try melt

df2 = (df.explode(df.columns.tolist())
      .apply(lambda col: col.str.split('; ')) # <-- space here
      .explode('col1')
      .explode('col2'))


out = (pd.crosstab(df2['col1'], df2['col2'])
       .melt(value_name='count', ignore_index=False)
       .reset_index())
print(out)

  col1    col2  count
0   Fr  Purple      0
1  Kev  Purple      2
2   Fr     Red      1
3  Kev     Red      2
4   Fr  Yellow      1
5  Kev  Yellow      1
  • Related