I have a frame like this
presence_data = pd.DataFrame({
"id": ["id1", "id2"],
"presence": [
["A", "B", "C", "A"],
["G", "A", "B", "I", "B"],
]
})
id | presence |
---|---|
id1 | [A, B, C, A] |
id2 | [G, A, B, I, B] |
I want to transform above into something like this...
id | A | B | C | G | I |
---|---|---|---|---|---|
id1 | 2 | 1 | 1 | 0 | 0 |
id2 | 1 | 2 | 0 | 1 | 1 |
Currently, I have a approach where I iterate over rows and iterate over values in presence column and then create/update new columns with count based on the values encountered. I want to see if there is a better way.
CodePudding user response:
edited based on feedback from Henry Ecker in comments, might as well have the better answer here:
You can use pd.explode()
to get everything within the lists to become separate rows, and then use pd.crosstab()
to count the occurrences.
df = presence_data.explode('presence')
pd.crosstab(index=df['id'],columns=df['presence'])
This gave me the following:
presence A B C G I
id
id1 2 1 1 0 0
id2 1 2 0 1 1
CodePudding user response:
from collections import Counter
(presence_data
.set_index('id')
.presence
.map(Counter)
.apply(pd.Series)
.fillna(0, downcast='infer')
.reset_index()
)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1
Speedwise it is hard to say; it is usually more efficient to deal with python native data structures within python, yet this solution has a lot of method calls, which in itself are relatively expensive
Alternatively, you can create a new dataframe ( and reduce the number of method calls):
(pd.DataFrame(map(Counter, presence_data.presence),
index = presence_data.id)
.fillna(0, downcast='infer')
.reset_index()
)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1
CodePudding user response:
You can use apply and value_counts. First we use the lists in your presence column to create new columns. We can then use axis=1 to get the row value counts.
df = pd.DataFrame(presence_data['presence'].tolist(), index=presence_data['id']).apply(pd.Series.value_counts, axis=1).fillna(0).astype(int)
print(df)
A B C G I
id
id1 2 1 1 0 0
id2 1 2 0 1 1
You can use this after if you want to have id as a column, rather than the index.
df.reset_index(inplace=True)
print(df)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1