I have a column that lists the race/ethnicity of individuals. I am trying to make it so that if the cell contains an 'H' then I only want H. Similarly, if the cell contains an 'N' then I want an N. Finally, if the cell has multiple races, not including H or N, then I want it to be M. Below is how it is listed currently and the desired output.
Current output
People | Race/Ethnicity
PersonA| HAB
PersonB| NHB
PersonC| AB
PersonD| ABW
PersonE| A
Desired output
PersonA| H
PersonB| N
PersonC| M
PersonD| M
PersonE| A
CodePudding user response:
You can try the following dplyr
approach, which combines grepl
with dplyr::case_when
to first search for N
values, then among those not with N
values, search for H
values, then among those without an H
or an N
will assign M
to those with >1 races and the original letter to those with only one race (assuming each race is represented by a single character).
A base R approach is below as well - no need for dependencies but but less elegant.
Data
df <- read.table(text = "person ethnicity
PersonA HAB
PersonB NHB
PersonC AB
PersonD ABW
PersonE A", header = TRUE)
dplyr (note order matters given your priority)
df %>% mutate(eth2 = case_when(
grepl("N", ethnicity) ~ "N",
grepl("H", ethnicity) ~ "H",
!grepl("H|N", ethnicity) & nchar(ethnicity) > 1 ~ "M",
TRUE ~ ethnicity
))
You could also do it "manually" in base r by indexing (note order matters given your priority):
df[grepl("H", df$ethnicity), "eth2"] <- "H"
df[grepl("N", df$ethnicity), "eth2"] <- "N"
df[!grepl("H|N", df$ethnicity) & nchar(df$ethnicity) > 1, "eth2"] <- "M"
df[nchar(df$ethnicity) %in% 1, "eth2"] <- df$ethnicity[nchar(df$ethnicity) %in% 1]
In both cases the output is:
# person ethnicity eth2
# 1 PersonA HAB H
# 2 PersonB NHB N
# 3 PersonC AB M
# 4 PersonD ABW M
# 5 PersonE A A
Note this is based on your comment about assigning superiority (that N
anywhere supersedes those with both N
and H
, etc)
CodePudding user response:
We could use str_extract
. When the number of characters in the column is greater than 1, extract, the 'N', 'M' separately, do a coalesce
with the extracted elements along with 'M' (thus if there is no match, we get 'M', or else it will be in the order we placed the inputs in coalecse
, For the other case, i.e. number of characters is 1, return the column values. Thus, N
supersedes 'H' no matter the position in the string.
library(dplyr)
library(stringr)
df1 %>%
mutate(output = case_when(nchar(`Race/Ethnicity`) > 1
~ coalesce(str_extract(`Race/Ethnicity`, 'N'),
str_extract(`Race/Ethnicity`, 'H'), "M"),
TRUE ~ `Race/Ethnicity`))
-output
People Race/Ethnicity output
1 PersonA HAB H
2 PersonB NHB N
3 PersonC AB M
4 PersonD ABW M
5 PersonE A A
data
df1 <- structure(list(People = c("PersonA", "PersonB", "PersonC", "PersonD",
"PersonE"), `Race/Ethnicity` = c("HAB", "NHB", "AB", "ABW", "A"
)), class = "data.frame", row.names = c(NA, -5L))