Home > Net >  pandas - split column with arrays into multiple columns and count values
pandas - split column with arrays into multiple columns and count values

Time:04-26

i have a pandas dataframe with columns that, themselves, contain np.array. Imagine having something like this:

import random
df = pd.DataFrame(data=[[[random.randint(1,7) for _ in range(10)] for _ in range(5)]], index=["col1"])
df = df.transpose()

which will result in a dataframe like this:

                              col1
0   [7, 7, 6, 7, 6, 5, 5, 1, 7, 4]
1   [4, 7, 5, 5, 6, 6, 5, 4, 7, 5]
2   [7, 2, 7, 7, 2, 7, 6, 7, 1, 2]
3   [5, 7, 1, 2, 6, 5, 4, 3, 5, 2]
4   [2, 3, 2, 6, 3, 3, 1, 1, 7, 7]

I want to expand the dataframe to a dataframe with columns ["col1",...."col7"] and count for each row the number of occurances.

The desired result should be an extended dataframe, containing integer values only.

    col1 col2 col3 col4 col5 col6 col7                         
0     1   0     0    1    2    2    4   
1     0   0     0    2    3    2    2
2     1   3     0    0    0    1    5 

My approach so far is pretty hard coded. I created col1,...col7 with 0 and after that I'm using iterrows() to count the occurances. This works well, but it's quite a lot of code and I'm sure there is a more elegant way to do this. Maybe something with .value_counts() for each array in a row?

Maybe someone can help me find it. Thanks

CodePudding user response:

np.random.seed(2022)

from collections import Counter
import numpy as np

df = pd.DataFrame(data=[[[np.random.randint(1,7) for _ in range(10)] for _ in range(5)]], 
                  index=["col1"])
df = df.transpose()

You can use Series.explode with SeriesGroupBy.value_counts and reshape by Series.unstack:

df1 = (df['col1'].explode()
                 .groupby(level=0)
                 .value_counts()
                 .unstack(fill_value=0)
                 .add_prefix('col')
                 .rename_axis(None, axis=1))
print (df1)
   col1  col2  col3  col4  col5  col6
0     4     2     1     0     1     2
1     3     2     0     4     0     1
2     3     1     3     2     0     1
3     1     1     3     0     1     4
4     1     1     1     1     3     3

Or use list comprehension with Counter and DataFrame constructor:

df1 = (pd.DataFrame([Counter(x) for x in df['col1']])
         .sort_index(axis=1)
         .fillna(0)
         .astype(int)
         .add_prefix('col'))
print (df1)
   col1  col2  col3  col4  col5  col6
0     4     2     1     0     1     2
1     3     2     0     4     0     1
2     3     1     3     2     0     1
3     1     1     3     0     1     4
4     1     1     1     1     3     3
  • Related