Home > Software design >  How to transform dataframe column containing list of values in to its own individual column with cou
How to transform dataframe column containing list of values in to its own individual column with cou

Time:11-28

I have a frame like this

presence_data = pd.DataFrame({
    "id": ["id1", "id2"],
    "presence": [
        ["A", "B", "C", "A"],
        ["G", "A", "B", "I", "B"],
    ]
})
id presence
id1 [A, B, C, A]
id2 [G, A, B, I, B]

I want to transform above into something like this...

id A B C G I
id1 2 1 1 0 0
id2 1 2 0 1 1

Currently, I have a approach where I iterate over rows and iterate over values in presence column and then create/update new columns with count based on the values encountered. I want to see if there is a better way.

CodePudding user response:

edited based on feedback from Henry Ecker in comments, might as well have the better answer here:

You can use pd.explode() to get everything within the lists to become separate rows, and then use pd.crosstab() to count the occurrences.

df = presence_data.explode('presence')
pd.crosstab(index=df['id'],columns=df['presence'])

This gave me the following:

presence  A  B  C  G  I
id
id1       2  1  1  0  0
id2       1  2  0  1  1

CodePudding user response:

from collections import Counter
(presence_data
 .set_index('id')
 .presence
 .map(Counter)
 .apply(pd.Series)
 .fillna(0, downcast='infer')
 .reset_index()
)

    id  A  B  C  G  I
0  id1  2  1  1  0  0
1  id2  1  2  0  1  1

Speedwise it is hard to say; it is usually more efficient to deal with python native data structures within python, yet this solution has a lot of method calls, which in itself are relatively expensive

Alternatively, you can create a new dataframe ( and reduce the number of method calls):

(pd.DataFrame(map(Counter, presence_data.presence), 
              index = presence_data.id)
   .fillna(0, downcast='infer')
   .reset_index()
)

    id  A  B  C  G  I
0  id1  2  1  1  0  0
1  id2  1  2  0  1  1

CodePudding user response:

You can use apply and value_counts. First we use the lists in your presence column to create new columns. We can then use axis=1 to get the row value counts.

df = pd.DataFrame(presence_data['presence'].tolist(), index=presence_data['id']).apply(pd.Series.value_counts, axis=1).fillna(0).astype(int)

print(df)

    A   B   C   G   I
id                  
id1 2   1   1   0   0
id2 1   2   0   1   1

You can use this after if you want to have id as a column, rather than the index.

df.reset_index(inplace=True)

print(df)

    id  A   B   C   G   I
0   id1 2   1   1   0   0
1   id2 1   2   0   1   1
  • Related