I have a dataframe that looks like this (this is a sample, the real dataframe has hundreds of columns):
0 Year Grade3MathPass Grade3MathTest Grade4MathPass Grade4MathTest
1 2019
2 2020 *** 15 5 15
3 2021 *** *** 12
4 2022 3 10 4 10
What I want to do is divide all of the Pass columns by the corresponding Test columns (Grade3, Grade4, etc.) and store the result in a new column for all rows. If both columns have numbers, the result should be a decimal. If either column has a ‘***’ (meaning insufficient n-size), the result should be ‘NA’ (or '***'). If the Pass column is blank (it means '0' passing), the result should be ‘0’. If both columns are blank (meaning no students tested, no students passed), the result should be ‘None.’
The resulting dataframe should look like this:
0 Year Grade3MathPass Grade3MathTest Grade3Average Grade4MathPass Grade4MathTest Grade4Average
1 2019
2 2020 *** 15 NA 5 15 .33333
3 2021 *** *** NA 12 0
4 2022 3 10 .30000 4 10 .40000
Among things I have tried:
- making a copy of the original dataframe with just the ‘***’ strings, coercing the original dataframe to numbers, performing the calculation and then using fillna to add the '***' back in;
- using a conditional loop to test each cell for a string before the calculation;
- using try/except to test for TypeError
and other things I can't even remember.
None of them work. The primary issue is that I need to retain the '***' value, because it is distinct from '0' or None. Any suggestions?
CodePudding user response:
Althought not so readability friendly, because of long chained operations, I think it's the most straightforward solution. The key to working with the mixed types is to use pd.to_numeric() on the conditions where both values are expected to be numeric, therefore we can generate numerical output but at the same type keep the other ones when a numeric output is not expected:
test['Grade3Average'] = np.where(
(test['Grade3MathPass'] == "***") |(test['Grade3MathTest'] == "***"),"***",
np.where((test['Grade3MathPass'].isna()) & (test['Grade3MathPass'].isna()),None,
np.where(test['Grade3MathPass'].isna(),0,
pd.to_numeric(test['Grade3MathPass'],errors='coerce')/pd.to_numeric(test['Grade3MathTest'],errors='coerce'))))
I called your dataframe test
and only created one column. I suggest creating a function and passing it to the specific columns you need:
0 Year Grade3MathPass ... Grade4MathTest Test Grade3Average
0 1 2019 None ... NaN None None
1 2 2020 *** ... 15.0 *** ***
2 3 2021 *** ... NaN *** ***
3 4 2022 3 ... 10.0 0.3 0.3
CodePudding user response:
Assuming your all columns are in the same format of Grade{number}MathPass, Grade{number}MathTest, etc. This may work:
end = 4
for i in range(3, end 1): # your end may change here
if type(df[f'Grade{i}MathPass']) == int:
df[f'Grade{i}Average'] = df[f'Grade{i}MathPass'] / df[f'Grade{i}MathTest']
else:
df[f'Grade{i}Average'] = 'NA'