I am working on a programming assignment and am seeking some help out here as the course openly encourages us to.
In the course of solving one of the problems, I need to find out from a dataset of children having received various vaccinations and their medical history the subset of children who both have (a) gotten chickenpox (HAD_CPOX, yes or no) AND (b) were vaccinated against chickenpox (P_NUMVRC, number of doses of chickenpox vaccine received)
The question is asking us to generate ratios by sex. So far I have:
- Extracted the csv and turned it into a pandas dataframe
- Created a subset dataframe titled 'subset_chickenpox' from the previous one containing only the relevant information (HAD_CPOX, SEX, P_NUMVRC)
- Further split each of these subsets between male and females resulting in the dataframes subset_COP_male and subset_COP_female and dropped NAs
I am now seeking to create a new column 'failedvax' in each of these subsets based on the following logic, containing two conditions:
- If the column HAD_CPOX is equal to 1 AND P_NUMVRC is not 0, then failedvax is True.
- Else failedvax is False.
I have tried various approaches, first using np.where which generated an error, and now I tried a different approach using a function, which also yields an error.
#Chickenpox question
#Relevant variable in original df is HAD_CPOX, sex is SEX, having been vacced is P_NUMVRC
#So first we subset for all these selecting all rows where df columns are these.
subset_chickenpox=df.loc[:,['HAD_CPOX', 'SEX', 'P_NUMVRC']]
#We are asked to calculate the ratio of children who were vaccinated but did get it
#versus the ratio of children who were vaccinated and did not get it
#In other words we want the ratio of rows where HAD_CPOX=1 and P_NUMVRC is 0
#vs the ratio of columns where HAD_CPOX=2 and P_NUMVRC != 0
subset_chickenpox.dropna()
#Because question by sex we do sex subsets
subset_COP_male=subset_chickenpox[(subset_chickenpox['SEX']==1)]
subset_COP_female=subset_chickenpox[(subset_chickenpox['SEX']==2)]
subset_COP_male.dropna()
subset_COP_female.dropna()
#I will now calculate a new column: failed_vax
#First we need to create a function to hold our conditional logic
#If HAD_CPOX=1 AND P_NUMVRC != 0 then failed_vax is true
def failed_vax(x):
if subset_COP_male['HAD_CPOX']==1 & subset_COP_male['P_NUMVRC']!=0:
return True
else:
return False
subset_COP_male['failed_vax'] = subset_COP_male.apply(failed_vax)
From this I get
``` TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool] ```
An example of the data (from subset_COP_female after dropping NAS) would be
```
INDEX HAD_CPOX SEX P_NUMVRC
3 2 2 1.0
4 2 2 0.0
6 2 2 1.0
7 2 2 0.0
20 2 2 1.0
... ... ... ...
28437 2 2 1.0
28440 2 2 1.0
28446 2 2 1.0
28452 2 2 1.0
28454 2 2 1.0
```
Thank you very much in advance! Would appreciate any pointers towards creating a new column based on the logic explained above. It's also not the first time I run into such a problem so I'd love to understand a bit more about why my code isn't working.
Best regards to all
CodePudding user response:
It's as simple as:
subset_COP_male['failed_vax'] = subset_COP_male['HAD_CPOX'].eq(1) & subset_COP_male['P_NUMVRC'].ne(0)
CodePudding user response:
You need to add your new column before you create your subsets.
This should work:
First add a new column 'failedvax' filled with 'False'.
Then apply the condition your asking, assign 'True' to the new column when the condition is met.
df['failedvax'] = False
df.loc[(df['HAD_CPOX']==1)&(df['P_NUMVRC']>0),'failedvax'] = True