i have a pandas dataframe with columns that, themselves, contain np.array. Imagine having something like this:
import random
df = pd.DataFrame(data=[[[random.randint(1,7) for _ in range(10)] for _ in range(5)]], index=["col1"])
df = df.transpose()
which will result in a dataframe like this:
col1
0 [7, 7, 6, 7, 6, 5, 5, 1, 7, 4]
1 [4, 7, 5, 5, 6, 6, 5, 4, 7, 5]
2 [7, 2, 7, 7, 2, 7, 6, 7, 1, 2]
3 [5, 7, 1, 2, 6, 5, 4, 3, 5, 2]
4 [2, 3, 2, 6, 3, 3, 1, 1, 7, 7]
I want to expand the dataframe to a dataframe with columns ["col1",...."col7"] and count for each row the number of occurances.
The desired result should be an extended dataframe, containing integer values only.
col1 col2 col3 col4 col5 col6 col7
0 1 0 0 1 2 2 4
1 0 0 0 2 3 2 2
2 1 3 0 0 0 1 5
My approach so far is pretty hard coded. I created col1,...col7 with 0 and after that I'm using iterrows() to count the occurances. This works well, but it's quite a lot of code and I'm sure there is a more elegant way to do this. Maybe something with .value_counts() for each array in a row?
Maybe someone can help me find it. Thanks
CodePudding user response:
np.random.seed(2022)
from collections import Counter
import numpy as np
df = pd.DataFrame(data=[[[np.random.randint(1,7) for _ in range(10)] for _ in range(5)]],
index=["col1"])
df = df.transpose()
You can use Series.explode
with SeriesGroupBy.value_counts
and reshape by Series.unstack
:
df1 = (df['col1'].explode()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0)
.add_prefix('col')
.rename_axis(None, axis=1))
print (df1)
col1 col2 col3 col4 col5 col6
0 4 2 1 0 1 2
1 3 2 0 4 0 1
2 3 1 3 2 0 1
3 1 1 3 0 1 4
4 1 1 1 1 3 3
Or use list comprehension with Counter
and DataFrame
constructor:
df1 = (pd.DataFrame([Counter(x) for x in df['col1']])
.sort_index(axis=1)
.fillna(0)
.astype(int)
.add_prefix('col'))
print (df1)
col1 col2 col3 col4 col5 col6
0 4 2 1 0 1 2
1 3 2 0 4 0 1
2 3 1 3 2 0 1
3 1 1 3 0 1 4
4 1 1 1 1 3 3