Home > Mobile >  create derived column and sample from distribution when missing
create derived column and sample from distribution when missing

Time:09-25

I try to impute missing values in one column by sampling from a given discrete distribution. I chose gender as a simple example. This is my attempt:

library(dplyr)

set.seed(42)

df <- data.frame(
    gender = c("f", "m", "m", NA, NA, NA, NA, NA, NA, NA, NA)
)
df

df <- df %>%
    mutate(
        derived_gender_of_casualty = ifelse(
                            is.na(gender)
                            , sample(x = c("m", "f"), prob = c(0.9, 0.1))
                            , as.character(gender))
    )
df

It creates:

enter image description here

However, it does not look right. The genders are uniformly distributed. I would expect there to be more males given the probability of 0.9 (this is not meant to be controversial - in hisight I should have picked another example haha!!!).

Any ideas? Thanks!

CodePudding user response:

Let's see what sample(x = c("m", "f"), prob = c(0.9, 0.1)) returns

sample(x = c("m", "f"), prob = c(0.9, 0.1))
#[1] "m" "f"

Try this multiple times, do you see any change? Maybe the order gets changed but do you see any effect of 0.9 or 0.1 ? Let's repeat this for 100 times and count number of times 'm' and 'f' occur.

table(replicate(100, sample(x = c("m", "f"), prob = c(0.9, 0.1))))

#  f   m 
#100 100 

Still the ratio is the same. By default sample has replace = FALSE and when you don't mention the n, it is same as length of the vector passed so the probability does not matter here. The reason why you get such an output is because sample(x = c("m", "f"), prob = c(0.9, 0.1)) returns two values with both m and f and the same value is recycled throughout.


Solution

library(dplyr)

set.seed(123)
df %>%
  mutate(
    derived_gender_of_casualty = ifelse(
      is.na(gender)
      , sample(x = c("m", "f"), n(), prob = c(0.9, 0.1), replace = TRUE)
      , as.character(gender))
  )

#   gender derived_gender_of_casualty
#1       f                          f
#2       m                          m
#3       m                          m
#4    <NA>                          m
#5    <NA>                          f
#6    <NA>                          m
#7    <NA>                          m
#8    <NA>                          m
#9    <NA>                          m
#10   <NA>                          m
#11   <NA>                          f

CodePudding user response:

A base R option:

set.seed(42)

df <- data.frame(
  gender = c("f", "m", "m", NA, NA, NA, NA, NA, NA, NA, NA)
)

  df$gender[is.na(df$gender)] <-  sample(x = c("m", "f"), 
                                         size = sum(is.na(df$gender)), 
                                         prob = c(0.9, 0.1), 
                                         replace = TRUE)
  
  df
#>    gender
#> 1       f
#> 2       m
#> 3       m
#> 4       f
#> 5       f
#> 6       m
#> 7       m
#> 8       m
#> 9       m
#> 10      m
#> 11      m

Created on 2021-09-24 by the reprex package (v2.0.0)

CodePudding user response:

Adding rowwise in your chain should do it:

df <- df %>%
  rowwise %>% 
  mutate(
    derived_gender_of_casualty = ifelse(
      is.na(gender)
      , sample(x = c("m", "f"), prob = c(0.9, 0.1))
      , as.character(gender))
  )
df

Here's a little experiment to show that it works. I make df have 1000 rows all missing on gender. The imputed data has Pr(M) approximately = 0.9.

set.seed(42)

df <- data.frame(
  gender = rep(NA, 1000)
)
df

df <- df %>%
  rowwise %>% 
  mutate(
    derived_gender_of_casualty = ifelse(
      is.na(gender)
      , sample(x = c("m", "f"), size = 1, prob = c(0.9, 0.1))
      , as.character(gender))
  )
mean(df$derived_gender_of_casualty == "m")
#[1] 0.899

  •  Tags:  
  • r
  • Related