I have a data frame in pandas and it looks like this
Terrain Distance Rank
77 Dirt 100 1
15 Grass 120 1
82 Road 180 1
4 Rock 100 1
107 Rock 120 1
70 Rock 200 1
115 Rock 200 1
37 Snow 160 1
57 Snow 160 1
95 Snow 160 1
193 Track 100 1
32 Dirt 100 2
97 Grass 140 2
51 Road 160 2
125 Road 180 2
90 Rock 140 2
60 Snow 120 2
78 Track 100 2
205 Track 120 2
33 Dirt 100 3
17 Dirt 140 3
53 Grass 100 3
161 Grass 100 3
43 Grass 160 3
81 Grass 160 3
103 Road 120 3
208 Road 160 3
58 Road 180 3
44 Rock 120 3
66 Rock 140 3
101 Rock 140 3
88 Rock 180 3
122 Rock 180 3
119 Sand 140 3
5 Sand 160 3
84 Snow 140 3
21 Snow 160 3
111 Snow 180 3
140 Track 140 3
29 Track 180 3
39 Track 200 3
2 Dirt 100 4
31 Dirt 140 4
102 Grass 140 4
134 Grass 160 4
108 Road 120 4
118 Road 120 4
I am able to create the crosstab with this code
### frequency table using crosstab()function
my_crosstab = pd.crosstab(index=df["Terrain"],
columns=df["Distance"],
margins=True) # Include row and column totals
my_crosstab
My cross table would then looks like this
Distance 100 120 140 160 180 200 All
Terrain
Dirt 12 5 9 5 4 5 40
Grass 4 5 8 8 2 6 33
Road 6 5 4 7 6 5 33
Rock 8 4 6 2 10 6 36
Sand 4 4 4 7 5 2 26
Snow 5 10 11 11 5 4 46
Track 9 6 4 6 6 4 35
All 48 39 46 46 38 32 249
Basically, I have 7 terrains and 6 distances. I would like to fill the crosstab table with the number of times that I came in first place for each cell in the table
Distance 100 120 140 160 180 200
Terrain
Dirt 1
Grass 1
Road 1
Rock 1 1 2
Sand
Snow 3
Track 1
1st place frequency based on Specific Terrain & Distance
CodePudding user response:
One approach would be to filter the columns based on where Rank
is (eq) 1
:
mask = df['Rank'].eq(1) # Reuseable boolean index
my_crosstab = pd.crosstab(index=df.loc[mask, "Terrain"],
columns=df.loc[mask, "Distance"],
margins=True)
my_crosstab
:
Distance 100 120 160 180 200 All
Terrain
Dirt 1 0 0 0 0 1
Grass 0 1 0 0 0 1
Road 0 0 0 1 0 1
Rock 1 1 0 0 2 4
Snow 0 0 3 0 0 3
Track 1 0 0 0 0 1
All 3 2 3 1 2 11
If wanting to restore all possible Distance and Terrain values we can reindex both index and columns:
mask = df['Rank'].eq(1)
my_crosstab = pd.crosstab(
index=df.loc[mask, "Terrain"],
columns=df.loc[mask, "Distance"],
margins=True
).reindex(
index=[*df['Terrain'].unique(), 'All'],
columns=[*np.sort(df['Distance'].unique()), 'All'], # Sorting just for aesthetics
fill_value=0
)
my_crosstab
:
Distance 100 120 140 160 180 200 All
Terrain
Dirt 1 0 0 0 0 0 1
Grass 0 1 0 0 0 0 1
Road 0 0 0 0 1 0 1
Rock 1 1 0 0 0 2 4
Snow 0 0 0 3 0 0 3
Track 1 0 0 0 0 0 1
Sand 0 0 0 0 0 0 0
All 3 2 0 3 1 2 11
If wanting spaces instead of zeros can follow up with a mask call:
mask = df['Rank'].eq(1)
my_crosstab = pd.crosstab(
index=df.loc[mask, "Terrain"],
columns=df.loc[mask, "Distance"],
margins=True
).reindex(
index=[*df['Terrain'].unique(), 'All'],
columns=[*np.sort(df['Distance'].unique()), 'All'],
fill_value=0
).mask(lambda df_: df_.eq(0), '')
my_crosstab
:
Distance 100 120 140 160 180 200 All
Terrain
Dirt 1 1
Grass 1 1
Road 1 1
Rock 1 1 2 4
Snow 3 3
Track 1 1
Sand
All 3 2 3 1 2 11
Setup and versions:
import numpy as np # version 1.23.1
import pandas as pd # version 1.4.3
df = pd.DataFrame({
'Terrain': ['Dirt', 'Grass', 'Road', 'Rock', 'Rock', 'Rock', 'Rock', 'Snow',
'Snow', 'Snow', 'Track', 'Dirt', 'Grass', 'Road', 'Road',
'Rock', 'Snow', 'Track', 'Track', 'Dirt', 'Dirt', 'Grass',
'Grass', 'Grass', 'Grass', 'Road', 'Road', 'Road', 'Rock',
'Rock', 'Rock', 'Rock', 'Rock', 'Sand', 'Sand', 'Snow', 'Snow',
'Snow', 'Track', 'Track', 'Track', 'Dirt', 'Dirt', 'Grass',
'Grass', 'Road', 'Road'],
'Distance': [100, 120, 180, 100, 120, 200, 200, 160, 160, 160, 100, 100,
140, 160, 180, 140, 120, 100, 120, 100, 140, 100, 100, 160,
160, 120, 160, 180, 120, 140, 140, 180, 180, 140, 160, 140,
160, 180, 140, 180, 200, 100, 140, 140, 160, 120, 120],
'Rank': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4,
4, 4, 4]
})
CodePudding user response:
df.Terrain = df.Terrain.astype('category')
df.Distance = df.Distance.astype('category')
out = (df[df.Rank.eq(1)]
.pivot_table(index='Terrain',
columns='Distance',
aggfunc='value_counts')
.reset_index(-1, drop=True))
print(out)
Output:
Distance 100 120 140 160 180 200
Terrain
Dirt 1 0 0 0 0 0
Grass 0 1 0 0 0 0
Road 0 0 0 0 1 0
Rock 1 1 0 0 0 2
Sand 0 0 0 0 0 0
Snow 0 0 0 3 0 0
Track 1 0 0 0 0 0
>>> out.replace(0, '')
Distance 100 120 140 160 180 200
Terrain
Dirt 1
Grass 1
Road 1
Rock 1 1 2
Sand
Snow 3
Track 1