I'm having an issue building a stack from the following df:
import pandas as pd
df = pd.DataFrame(
{
"Student": ['Daphne','Scooby','Daphne','Shaggy','Fred','Daphne'],
"window": [['Fred'],['Daphne'], [''], ['Daphne'],['Velma','Scrappy'],['Velma','Fred']]
}
)
which gives you:
Student window
0 Daphne [Fred]
1 Scooby [Daphne]
2 Daphne []
3 Shaggy [Daphne]
4 Fred [Velma, Scrappy]
5 Daphne [Velma, Fred]
I wanted to create a stack table that would show how many times each student has seen the other students in the window. I wanted the end result to be like this:
Daphne Fred Scrappy Velma
Daphne 0 2 0 1
Fred 0 0 1 1
Scrappy 0 0 0 0
Velma 0 0 0 0
I was thinking I could do something like this:
dfd = pd.get_dummies(df["window"].dropna().apply(pd.Series).stack())
which gives you:
Daphne Fred Scrappy Velma
1 0 1 0 0 0
2 0 0 0 0 0
3 0 1 0 0 0
4 0 0 0 0 1
1 0 0 1 0
5 0 0 0 0 1
1 0 1 0 0
and that is not going to give me the results I want once I implement:
xrf_df = dfd.T.dot(dfd)
which gives you
Daphne Fred Scrappy Velma
1 0 0 0 0
Daphne 0 2 0 0 0
Fred 0 0 2 0 1
Scrappy 0 0 0 1 1
Velma 0 0 1 1 2
I was thinking of using this to remove the empty spaces:
dfd = dfd.drop(0)
del dfd['']
but even that won't give me the right numbers in the right columns.
Could someone tell me what I'm doing wrong?
CodePudding user response:
one way of solving i this,
df.explode('window').pivot_table(index='Student', columns='window', aggfunc='size', fill_value=0)
O/P;
window Daphne Fred Scrappy Velma
Student
Daphne 1 0 2 0 1
Fred 0 0 0 1 1
Scooby 0 1 0 0 0
Shaggy 0 1 0 0 0
- Explode window column to remove list
- perform pivot with column as window, index as student and count as aggrigate function and finally fill nan with 0
CodePudding user response:
If I understand correctly you need .explode
with .value_counts()
and .unstack()
new = df.explode('window')\
.groupby('Student')['window'].value_counts().unstack(-1,fill_value=0)
print(new)
window Daphne Fred Scrappy Velma
Student
Daphne 1 0 2 0 1
Fred 0 0 0 1 1
Scooby 0 1 0 0 0
Shaggy 0 1 0 0 0
CodePudding user response:
Here is a way using str.get_dummies()
df.set_index('Student')['window'].str.join('|').str.get_dummies().groupby(level=0).sum()
Output:
Daphne Fred Scrappy Velma
Student
Daphne 0 2 0 1
Fred 0 0 1 1
Scooby 1 0 0 0
Shaggy 1 0 0 0