Home > Back-end >  How to create a new column in a pandas df based on multiple conditions?
How to create a new column in a pandas df based on multiple conditions?

Time:05-04

I am working on a programming assignment and am seeking some help out here as the course openly encourages us to.

In the course of solving one of the problems, I need to find out from a dataset of children having received various vaccinations and their medical history the subset of children who both have (a) gotten chickenpox (HAD_CPOX, yes or no) AND (b) were vaccinated against chickenpox (P_NUMVRC, number of doses of chickenpox vaccine received)

The question is asking us to generate ratios by sex. So far I have:

  • Extracted the csv and turned it into a pandas dataframe
  • Created a subset dataframe titled 'subset_chickenpox' from the previous one containing only the relevant information (HAD_CPOX, SEX, P_NUMVRC)
  • Further split each of these subsets between male and females resulting in the dataframes subset_COP_male and subset_COP_female and dropped NAs

I am now seeking to create a new column 'failedvax' in each of these subsets based on the following logic, containing two conditions:

  • If the column HAD_CPOX is equal to 1 AND P_NUMVRC is not 0, then failedvax is True.
  • Else failedvax is False.

I have tried various approaches, first using np.where which generated an error, and now I tried a different approach using a function, which also yields an error.

 #Chickenpox question
#Relevant variable in original df is HAD_CPOX, sex is SEX, having been vacced is P_NUMVRC 
#So first we subset for all these selecting all rows where df columns are these.
subset_chickenpox=df.loc[:,['HAD_CPOX', 'SEX', 'P_NUMVRC']]
#We are asked to calculate the ratio of children who were vaccinated but did get it 
#versus the ratio of children who were vaccinated and did not get it
#In other words we want the ratio of rows where HAD_CPOX=1 and P_NUMVRC is 0 
#vs the ratio of columns where HAD_CPOX=2 and P_NUMVRC != 0
subset_chickenpox.dropna()
#Because question by sex we do sex subsets
subset_COP_male=subset_chickenpox[(subset_chickenpox['SEX']==1)]
subset_COP_female=subset_chickenpox[(subset_chickenpox['SEX']==2)]
subset_COP_male.dropna()
subset_COP_female.dropna()
#I will now calculate a new column: failed_vax
#First we need to create a function to hold our conditional logic
#If HAD_CPOX=1 AND P_NUMVRC != 0 then failed_vax is true
def failed_vax(x):
    if subset_COP_male['HAD_CPOX']==1 & subset_COP_male['P_NUMVRC']!=0:
        return True
    else:
        return False
    
subset_COP_male['failed_vax'] = subset_COP_male.apply(failed_vax)

From this I get

``` TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool] ``` 

An example of the data (from subset_COP_female after dropping NAS) would be

``` 
INDEX HAD_CPOX  SEX P_NUMVRC
3   2   2   1.0
4   2   2   0.0
6   2   2   1.0
7   2   2   0.0
20  2   2   1.0
... ... ... ...
28437   2   2   1.0
28440   2   2   1.0
28446   2   2   1.0
28452   2   2   1.0
28454   2   2   1.0
``` 

Thank you very much in advance! Would appreciate any pointers towards creating a new column based on the logic explained above. It's also not the first time I run into such a problem so I'd love to understand a bit more about why my code isn't working.

Best regards to all

CodePudding user response:

It's as simple as:

subset_COP_male['failed_vax'] = subset_COP_male['HAD_CPOX'].eq(1) & subset_COP_male['P_NUMVRC'].ne(0)

CodePudding user response:

You need to add your new column before you create your subsets.
This should work:
First add a new column 'failedvax' filled with 'False'.
Then apply the condition your asking, assign 'True' to the new column when the condition is met.

df['failedvax'] = False
df.loc[(df['HAD_CPOX']==1)&(df['P_NUMVRC']>0),'failedvax'] = True
  • Related