Home > OS >  pandas data frame Podium crosstab frequency
pandas data frame Podium crosstab frequency

Time:07-24

I have a data frame in pandas and it looks like this

    Terrain  Distance  Rank
77     Dirt       100     1
15    Grass       120     1
82     Road       180     1
4      Rock       100     1
107    Rock       120     1
70     Rock       200     1
115    Rock       200     1
37     Snow       160     1
57     Snow       160     1
95     Snow       160     1
193   Track       100     1
32     Dirt       100     2
97    Grass       140     2
51     Road       160     2
125    Road       180     2
90     Rock       140     2
60     Snow       120     2
78    Track       100     2
205   Track       120     2
33     Dirt       100     3
17     Dirt       140     3
53    Grass       100     3
161   Grass       100     3
43    Grass       160     3
81    Grass       160     3
103    Road       120     3
208    Road       160     3
58     Road       180     3
44     Rock       120     3
66     Rock       140     3
101    Rock       140     3
88     Rock       180     3
122    Rock       180     3
119    Sand       140     3
5      Sand       160     3
84     Snow       140     3
21     Snow       160     3
111    Snow       180     3
140   Track       140     3
29    Track       180     3
39    Track       200     3
2      Dirt       100     4
31     Dirt       140     4
102   Grass       140     4
134   Grass       160     4
108    Road       120     4
118    Road       120     4

I am able to create the crosstab with this code

### frequency table using crosstab()function
my_crosstab = pd.crosstab(index=df["Terrain"], 
                            columns=df["Distance"],
                             margins=True)   # Include row and column totals
my_crosstab

My cross table would then looks like this

Distance    100 120 140 160 180 200 All
Terrain                         
Dirt    12  5   9   5   4   5   40
Grass   4   5   8   8   2   6   33
Road    6   5   4   7   6   5   33
Rock    8   4   6   2   10  6   36
Sand    4   4   4   7   5   2   26
Snow    5   10  11  11  5   4   46
Track   9   6   4   6   6   4   35
All    48   39  46  46  38  32  249

Basically, I have 7 terrains and 6 distances. I would like to fill the crosstab table with the number of times that I came in first place for each cell in the table

Distance    100 120 140 160 180 200 
Terrain                         
Dirt        1  
Grass            1
Road                         1
Rock        1    1               2                     
Sand    
Snow                    3 
Track       1

1st place frequency based on Specific Terrain & Distance

CodePudding user response:

One approach would be to filter the columns based on where Rank is (eq) 1:

mask = df['Rank'].eq(1)  # Reuseable boolean index
my_crosstab = pd.crosstab(index=df.loc[mask, "Terrain"],
                          columns=df.loc[mask, "Distance"],
                          margins=True)

my_crosstab:

Distance  100  120  160  180  200  All
Terrain                               
Dirt        1    0    0    0    0    1
Grass       0    1    0    0    0    1
Road        0    0    0    1    0    1
Rock        1    1    0    0    2    4
Snow        0    0    3    0    0    3
Track       1    0    0    0    0    1
All         3    2    3    1    2   11

If wanting to restore all possible Distance and Terrain values we can reindex both index and columns:

mask = df['Rank'].eq(1)
my_crosstab = pd.crosstab(
    index=df.loc[mask, "Terrain"],
    columns=df.loc[mask, "Distance"],
    margins=True
).reindex(
    index=[*df['Terrain'].unique(), 'All'],
    columns=[*np.sort(df['Distance'].unique()), 'All'],  # Sorting just for aesthetics
    fill_value=0
)

my_crosstab:

Distance  100  120  140  160  180  200  All
Terrain                                    
Dirt        1    0    0    0    0    0    1
Grass       0    1    0    0    0    0    1
Road        0    0    0    0    1    0    1
Rock        1    1    0    0    0    2    4
Snow        0    0    0    3    0    0    3
Track       1    0    0    0    0    0    1
Sand        0    0    0    0    0    0    0
All         3    2    0    3    1    2   11

If wanting spaces instead of zeros can follow up with a mask call:

mask = df['Rank'].eq(1)
my_crosstab = pd.crosstab(
    index=df.loc[mask, "Terrain"],
    columns=df.loc[mask, "Distance"],
    margins=True
).reindex(
    index=[*df['Terrain'].unique(), 'All'],
    columns=[*np.sort(df['Distance'].unique()), 'All'],
    fill_value=0
).mask(lambda df_: df_.eq(0), '')

my_crosstab:

Distance 100 120 140 160 180 200 All
Terrain                             
Dirt       1                       1
Grass          1                   1
Road                       1       1
Rock       1   1               2   4
Snow                   3           3
Track      1                       1
Sand                                
All        3   2       3   1   2  11

Setup and versions:

import numpy as np  # version 1.23.1
import pandas as pd  # version 1.4.3

df = pd.DataFrame({
    'Terrain': ['Dirt', 'Grass', 'Road', 'Rock', 'Rock', 'Rock', 'Rock', 'Snow',
                'Snow', 'Snow', 'Track', 'Dirt', 'Grass', 'Road', 'Road',
                'Rock', 'Snow', 'Track', 'Track', 'Dirt', 'Dirt', 'Grass',
                'Grass', 'Grass', 'Grass', 'Road', 'Road', 'Road', 'Rock',
                'Rock', 'Rock', 'Rock', 'Rock', 'Sand', 'Sand', 'Snow', 'Snow',
                'Snow', 'Track', 'Track', 'Track', 'Dirt', 'Dirt', 'Grass',
                'Grass', 'Road', 'Road'],
    'Distance': [100, 120, 180, 100, 120, 200, 200, 160, 160, 160, 100, 100,
                 140, 160, 180, 140, 120, 100, 120, 100, 140, 100, 100, 160,
                 160, 120, 160, 180, 120, 140, 140, 180, 180, 140, 160, 140,
                 160, 180, 140, 180, 200, 100, 140, 140, 160, 120, 120],
    'Rank': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4,
             4, 4, 4]
})

CodePudding user response:

df.Terrain = df.Terrain.astype('category')
df.Distance = df.Distance.astype('category')
out = (df[df.Rank.eq(1)]
         .pivot_table(index='Terrain', 
                      columns='Distance', 
                      aggfunc='value_counts')
         .reset_index(-1, drop=True))
print(out)

Output:

Distance  100  120  140  160  180  200
Terrain
Dirt        1    0    0    0    0    0
Grass       0    1    0    0    0    0
Road        0    0    0    0    1    0
Rock        1    1    0    0    0    2
Sand        0    0    0    0    0    0
Snow        0    0    0    3    0    0
Track       1    0    0    0    0    0
>>> out.replace(0, '')
Distance 100 120 140 160 180 200
Terrain
Dirt       1
Grass          1
Road                       1
Rock       1   1               2
Sand
Snow                   3
Track      1
  • Related