How to transform dataframe column containing list of values in to its own individual column with cou-CodePudding

I have a frame like this

presence_data = pd.DataFrame({
    "id": ["id1", "id2"],
    "presence": [
        ["A", "B", "C", "A"],
        ["G", "A", "B", "I", "B"],
    ]
})

id	presence
id1	[A, B, C, A]
id2	[G, A, B, I, B]

I want to transform above into something like this...

id	A	B	C	G	I
id1	2	1	1	0	0
id2	1	2	0	1	1

Currently, I have a approach where I iterate over rows and iterate over values in presence column and then create/update new columns with count based on the values encountered. I want to see if there is a better way.

CodePudding user response：

edited based on feedback from Henry Ecker in comments, might as well have the better answer here:

You can use pd.explode() to get everything within the lists to become separate rows, and then use pd.crosstab() to count the occurrences.

df = presence_data.explode('presence')
pd.crosstab(index=df['id'],columns=df['presence'])

This gave me the following:

presence  A  B  C  G  I
id
id1       2  1  1  0  0
id2       1  2  0  1  1

CodePudding user response：

from collections import Counter
(presence_data
 .set_index('id')
 .presence
 .map(Counter)
 .apply(pd.Series)
 .fillna(0, downcast='infer')
 .reset_index()
)

    id  A  B  C  G  I
0  id1  2  1  1  0  0
1  id2  1  2  0  1  1

Speedwise it is hard to say; it is usually more efficient to deal with python native data structures within python, yet this solution has a lot of method calls, which in itself are relatively expensive

Alternatively, you can create a new dataframe ( and reduce the number of method calls):

(pd.DataFrame(map(Counter, presence_data.presence), 
              index = presence_data.id)
   .fillna(0, downcast='infer')
   .reset_index()
)

    id  A  B  C  G  I
0  id1  2  1  1  0  0
1  id2  1  2  0  1  1

CodePudding user response：

You can use apply and value_counts. First we use the lists in your presence column to create new columns. We can then use axis=1 to get the row value counts.

df = pd.DataFrame(presence_data['presence'].tolist(), index=presence_data['id']).apply(pd.Series.value_counts, axis=1).fillna(0).astype(int)

print(df)

    A   B   C   G   I
id                  
id1 2   1   1   0   0
id2 1   2   0   1   1

You can use this after if you want to have id as a column, rather than the index.

df.reset_index(inplace=True)

print(df)

    id  A   B   C   G   I
0   id1 2   1   1   0   0
1   id2 1   2   0   1   1