I have a question about conditional random assignment. The simplified dataset looks as below:
COMPANY BOARDROLE INSIDER
A Acting Director Yes
B CEO Yes
C Independent Director No
D Chairman Unknown
E Chairman Unknown
F Member Unknown
G Independent Director Outsider
H Member Unknown
I Member Unknown
J Member Unknown
Now I want to create a fourth column, Insider Presence, that either has the value of 1 or 0. Obviously if the third column said no or outsider, there is no insider, so the Insider Presence should be 0. I know I can achieve that with the following function:
pattern <- paste(c("No", "Outsider"), collapse = "|")
df <- df %>%
mutate(`InsiderPresence` = ifelse(str_detect(Insider, pattern), 0, 1))
But now I also want to achieve that randomly 50% of the 'Unknown' is also laballed as 1. So that you get, for example the following output:
COMPANY BOARDROLE INSIDER INSIDER PRESENCE
A Acting Director Yes 1
B CEO Yes 1
C Independent Director No 0
D Chairman Unknown 1
E Chairman Unknown 0
F Member Unknown 0
G Independent Director Outsider 0
H Member Unknown 0
I Member Unknown 1
J Member Unknown 1
I hope that anyone can help me
CodePudding user response:
Here is an option
library(tidyverse)
df %>%
mutate(`Insider Presence` = case_when(
str_detect(INSIDER, "Yes") ~ 1L,
str_detect(INSIDER, "No|Outsider") ~ 0L,
str_detect(INSIDER, "Unknown") ~ sample(c(0L, 1L), n(), replace = TRUE),
TRUE ~ NA_integer_))
# COMPANY BOARDROLE INSIDER Insider Presence
# 1 A Acting Director Yes 1
# 2 B CEO Yes 1
# 3 C Independent Director No 0
# 4 D Chairman Unknown 1
# 5 E Chairman Unknown 1
# 6 F Member Unknown 0
# 7 G Independent Director Outsider 0
# 8 H Member Unknown 1
# 9 I Member Unknown 1
#10 J Member Unknown 1
We use case_when
to cover all cases; the last line TRUE ~ NA_integer_
should never occur, but it is good practice to include a fall-through for debugging. We use sample
to uniformly sample values with replacement from (0, 1), i.e. we draw samples from (0, 1) with a 50% probability.
Note that we draw as many samples here as there are total rows N_tot (and not just rows with INSIDER == "Unknown"). Drawing samples from N_tot with a 50% prop means that any subset will also have a 50% split (at least asymptotically for large enough sample sizes).
Sample data
df <- read.table(text = "COMPANY BOARDROLE INSIDER
A 'Acting Director' Yes
B CEO Yes
C 'Independent Director' No
D Chairman Unknown
E Chairman Unknown
F Member Unknown
G 'Independent Director' Outsider
H Member Unknown
I Member Unknown
J Member Unknown", header = T)
CodePudding user response:
This is nearly the same as MauritsEvers' already-accepted answer. The behavior in this answer is slightly different: it guarantees 50% (or rounded-up if odd rows) of Unknown
will be set to 1
instead of random ratios (which could still include 0-100%).
library(dplyr)
library(stringr)
dat %>%
group_by(grp = grepl("Unknown", INSIDER)) %>%
mutate(Presence = case_when(
str_detect(INSIDER, "Yes") ~ 1L,
str_detect(INSIDER, "No|Outsider") ~ 0L,
str_detect(INSIDER, "Unknown") ~ (row_number() %in% sample(n(), size = ceiling(n()/2))),
TRUE ~ NA_integer_)) %>%
ungroup() %>%
select(-grp)
# # A tibble: 10 x 4
# COMPANY BOARDROLE INSIDER Presence
# <chr> <chr> <chr> <int>
# 1 A Acting Director Yes 1
# 2 B CEO Yes 1
# 3 C Independent Director No 0
# 4 D Chairman Unknown 0
# 5 E Chairman Unknown 0
# 6 F Member Unknown 1
# 7 G Independent Director Outsider 0
# 8 H Member Unknown 1
# 9 I Member Unknown 0
# 10 J Member Unknown 1