Why is My Multiple Selection Not Working in R?-CodePudding

I have a dataset called PimaDiabetes.

PimaDiabetes <- read.csv("PimaDiabetes.csv")
PimaDiabetes[2:8][PimaDiabetes[2:8]==0] <- NA
mean_1 = 40.5
mean_0 = 30.7
p.tib <- PimaDiabetes %>%
  as_tibble()

Here is a snapshot of the data:

And the dataset can be pulled from here.

I'm trying to navigate the columns in such a way that I can group the dataset by Outcomes (so to select for Outcome 0 and 1), and impute a different value (the median of the respected groups) into columns depending on the outcomes.

So for instance, in the fifth column, Insulin, there are some NA values down the line where the Outcome is 1, and some where the Outcome is 0. I would like to place a value (40.5) into it when the value in a row is NA, and the Outcome is 1. Then I'd like to put the mean_2 into it when the value is NA, and the Outcome is 0.

I've gotten advice prior to this and tried:

p.tib %>% 
  mutate(
    p.tib$Insulin = case_when((p.tib$Outcome == 0) & (is.na(p.tib$Insulin)) ~ IN_0,
                                           (p.tib$Outcome == 1) & (is.na(p.tib$Insulin) ~ IN_1,
                                            TRUE ~ p.tib$Insulin))

However it constantly yields the following error:

Error: unexpected '=' in "p.tib %>% mutate(p.tib$Insulin ="

Can I know where things are going wrong, please?

CodePudding user response：

Setup

It appears this dataset is also in the pdp package in R, called pima. The only major difference between the R package data and yours is that the pima dataset's Outcome variable is simply called "diabetes" instead and is labeled "pos" and "neg" instead of 0/1. I have loaded that package and the tidyverse to help.

#### Load Libraries ####
library(pdp)
library(tidyverse)

First I transformed the data into a tibble so it was easier for me to read.

#### Reformat Data ####
p.tib <- pima %>% 
  as_tibble()

Printing p.tib, we can see that the insulin variable has a lot of NA values in the first rows, which will be quicker to visualize later than some of the other variables that have missing data. Therefore, I used that instead of glucose, but the idea is the same.

# A tibble: 768 × 9
   pregnant glucose press…¹ triceps insulin  mass pedig…²   age diabe…³
      <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl> <fct>  
 1        6     148      72      35      NA  33.6   0.627    50 pos    
 2        1      85      66      29      NA  26.6   0.351    31 neg    
 3        8     183      64      NA      NA  23.3   0.672    32 pos    
 4        1      89      66      23      94  28.1   0.167    21 neg    
 5        0     137      40      35     168  43.1   2.29     33 pos    
 6        5     116      74      NA      NA  25.6   0.201    30 neg    
 7        3      78      50      32      88  31     0.248    26 pos    
 8       10     115      NA      NA      NA  35.3   0.134    29 neg    
 9        2     197      70      45     543  30.5   0.158    53 pos    
10        8     125      96      NA      NA  NA     0.232    54 pos    
# … with 758 more rows, and abbreviated variable names ¹pressure,
#   ²pedigree, ³diabetes
# ℹ Use `print(n = ...)` to see more rows

Finding the Mean

After glimpsing the data, I checked the mean for each group who did and didn't have diabetes by first grouping by diabetes with group_by, then collapsing the data frame into a summary of each group's mean, thus creating the mean_insulin variable (which you can see removes NA values to derive the mean):

#### Check Mean by Group ####
p.tib %>% 
  group_by(diabetes) %>% 
  summarise(mean_insulin = mean(insulin,
                                    na.rm=T))

The values we should be imputing seem to be below. Here the groups are labeled as "neg" or 0 in your data, and "pos", or 1 in your data. You can convert these groups into those numbers if you want, but I left it as is so it was easier to read:

# A tibble: 2 × 2
  diabetes mean_insulin
  <fct>           <dbl>
1 neg              130.
2 pos              207.

Mean Imputation

From there, we will use case_when as a vectorized ifelse statement. First, we use mutate to transform insulin. Then we use case_when by setting up three tests. First, if the group is negative and the value is NA, we turn it into the mean value of 130. If the group is positive for the same condition, we use 207. For all other values (the TRUE part), we just use the normal value of insulin. The & operator here just says "this transformation can only take place if both of these tests are true". What follows the ~ is the transformation to take place.

#### Impute Mean ####
p.tib %>% 
  mutate(
    insulin = case_when(
      (diabetes == "neg") & (is.na(insulin)) ~ 130,
      (diabetes == "pos") & (is.na(insulin)) ~ 207,
      TRUE ~ insulin
    )
  )

You will now notice that the first rows of insulin data are replaced with the mutation and the rest are left alone:

# A tibble: 768 × 9
   pregnant glucose press…¹ triceps insulin  mass pedig…²   age diabe…³
      <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl> <fct>  
 1        6     148      72      35     207  33.6   0.627    50 pos    
 2        1      85      66      29     130  26.6   0.351    31 neg    
 3        8     183      64      NA     207  23.3   0.672    32 pos    
 4        1      89      66      23      94  28.1   0.167    21 neg    
 5        0     137      40      35     168  43.1   2.29     33 pos    
 6        5     116      74      NA     130  25.6   0.201    30 neg    
 7        3      78      50      32      88  31     0.248    26 pos    
 8       10     115      NA      NA     130  35.3   0.134    29 neg    
 9        2     197      70      45     543  30.5   0.158    53 pos    
10        8     125      96      NA     207  NA     0.232    54 pos    
# … with 758 more rows, and abbreviated variable names ¹pressure,
#   ²pedigree, ³diabetes
# ℹ Use `print(n = ...)` to see more rows