Pandas value_counts for a column based on value of different column-CodePudding

I have a dataframe 'students' that looks like the following:

         Cumulative.GPA  Athlete
     0   3.9             Yes
     1   3.3             Yes
     2   4.0             No
     3   3.6             Yes

I'm trying to get a value_counts table of GPAs separated into two columns: 1 for athletes and 1 for non-athletes. Output should look like this (with leftmost column obviously continuing down to 0.0)

               Cumulative.GPA.athlete   Cumulative.GPA.nonathlete
        4.0    0                        1
        3.9    1                        0
        3.8    0                        0
        3.7    0                        0
        3.6    1                        0
        3.5    0                        0
        3.4    0                        0
        3.3    1                        0

CodePudding user response：

We can use crosstab to count the frequencies of Athlete vs non-athlete GPA then reindex to add missing GPA increments then rename and rename_axis to make the output match the expected:

res = (
    pd.crosstab(df['Cumulative.GPA'], df['Athlete'])
        .reindex(index=np.arange(40, 33 - 1, -1) / 10,
                 columns=['Yes', 'No'],
                 fill_value=0)
        .rename(columns={'Yes': 'Cumulative.GPA.athlete',
                         'No': 'Cumulative.GPA.nonathlete'})
        .rename_axis(index=None, columns=None)
)

*Using int steps to avoid any precision issues

res:

     Cumulative.GPA.athlete  Cumulative.GPA.nonathlete
4.0                       0                          1
3.9                       1                          0
3.8                       0                          0
3.7                       0                          0
3.6                       1                          0
3.5                       0                          0
3.4                       0                          0
3.3                       1                          0

If wanting the shown GPA range to be dynamic, instead of hardcoded bounds, Series.max and Series.min can be used to set the reindexing bounds as recommended by U12-Forward:

res = (
    pd.crosstab(df['Cumulative.GPA'], df['Athlete'])
        .reindex(index=np.arange(int(df['Cumulative.GPA'].max() * 10),
                                 int(df['Cumulative.GPA'].min() * 10) - 1,
                                 -1) / 10,
                 columns=['Yes', 'No'],
                 fill_value=0)
        .rename(columns={'Yes': 'Cumulative.GPA.athlete',
                         'No': 'Cumulative.GPA.nonathlete'})
        .rename_axis(index=None, columns=None)
)

Setup Used:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Cumulative.GPA': [3.9, 3.3, 4.0, 3.6],
    'Athlete': ['Yes', 'Yes', 'No', 'Yes']
})

CodePudding user response：

Try using get_dummies with set_index and reindex:

>>> pd.get_dummies(df, 'Athlete').set_index('Cumulative.GPA').reindex(np.round(np.arange(df['Cumulative.GPA'].min(), df['Cumulative.GPA'].max(), 0.1), 1)[::-1], fill_value=0).reset_index()
   Cumulative.GPA  Athlete_No  Athlete_Yes
0             4.0           1            0
1             3.9           0            1
2             3.8           0            0
3             3.7           0            0
4             3.6           0            1
5             3.5           0            0
6             3.4           0            0
7             3.3           0            1
>>>