Home > OS >  Divide mixed-type dataframe columns- retain string value if either column value is a string
Divide mixed-type dataframe columns- retain string value if either column value is a string

Time:07-30

I have a dataframe that looks like this (this is a sample, the real dataframe has hundreds of columns):

0 Year Grade3MathPass Grade3MathTest Grade4MathPass Grade4MathTest
1 2019                                       
2 2020 ***            15              5              15
3 2021 ***            ***                            12
4 2022 3              10              4              10

What I want to do is divide all of the Pass columns by the corresponding Test columns (Grade3, Grade4, etc.) and store the result in a new column for all rows. If both columns have numbers, the result should be a decimal. If either column has a ‘***’ (meaning insufficient n-size), the result should be ‘NA’ (or '***'). If the Pass column is blank (it means '0' passing), the result should be ‘0’. If both columns are blank (meaning no students tested, no students passed), the result should be ‘None.’

The resulting dataframe should look like this:

0 Year Grade3MathPass Grade3MathTest Grade3Average Grade4MathPass Grade4MathTest Grade4Average
1 2019                                 
2 2020 ***            15             NA            5              15             .33333
3 2021 ***            ***            NA                           12             0
4 2022 3              10             .30000        4              10             .40000

Among things I have tried:

  • making a copy of the original dataframe with just the ‘***’ strings, coercing the original dataframe to numbers, performing the calculation and then using fillna to add the '***' back in;
  • using a conditional loop to test each cell for a string before the calculation;
  • using try/except to test for TypeError

and other things I can't even remember.

None of them work. The primary issue is that I need to retain the '***' value, because it is distinct from '0' or None. Any suggestions?

CodePudding user response:

Althought not so readability friendly, because of long chained operations, I think it's the most straightforward solution. The key to working with the mixed types is to use pd.to_numeric() on the conditions where both values are expected to be numeric, therefore we can generate numerical output but at the same type keep the other ones when a numeric output is not expected:

test['Grade3Average'] = np.where(
    (test['Grade3MathPass'] == "***") |(test['Grade3MathTest'] == "***"),"***",
                        np.where((test['Grade3MathPass'].isna()) & (test['Grade3MathPass'].isna()),None,
                                 np.where(test['Grade3MathPass'].isna(),0,
                                          pd.to_numeric(test['Grade3MathPass'],errors='coerce')/pd.to_numeric(test['Grade3MathTest'],errors='coerce'))))

I called your dataframe test and only created one column. I suggest creating a function and passing it to the specific columns you need:

   0  Year Grade3MathPass  ... Grade4MathTest  Test  Grade3Average
0  1  2019           None  ...            NaN  None           None
1  2  2020            ***  ...           15.0   ***            ***
2  3  2021            ***  ...            NaN   ***            ***
3  4  2022              3  ...           10.0   0.3            0.3

CodePudding user response:

Assuming your all columns are in the same format of Grade{number}MathPass, Grade{number}MathTest, etc. This may work:

end = 4
for i in range(3, end  1): # your end may change here 
    if type(df[f'Grade{i}MathPass']) == int:
        df[f'Grade{i}Average'] = df[f'Grade{i}MathPass'] / df[f'Grade{i}MathTest']
    else:
        df[f'Grade{i}Average'] = 'NA'
  • Related