Change cell contents if it contains a certain letter-CodePudding

I have a column that lists the race/ethnicity of individuals. I am trying to make it so that if the cell contains an 'H' then I only want H. Similarly, if the cell contains an 'N' then I want an N. Finally, if the cell has multiple races, not including H or N, then I want it to be M. Below is how it is listed currently and the desired output.

Current output

People | Race/Ethnicity

PersonA| HAB
PersonB| NHB
PersonC| AB
PersonD| ABW
PersonE| A

Desired output
PersonA| H
PersonB| N
PersonC| M
PersonD| M
PersonE| A

CodePudding user response：

You can try the following dplyr approach, which combines grepl with dplyr::case_when to first search for N values, then among those not with N values, search for H values, then among those without an H or an N will assign M to those with >1 races and the original letter to those with only one race (assuming each race is represented by a single character).

A base R approach is below as well - no need for dependencies but but less elegant.

Data

df <- read.table(text = "person ethnicity 
PersonA HAB
PersonB NHB
PersonC AB
PersonD ABW
PersonE A", header = TRUE)

dplyr (note order matters given your priority)

df %>% mutate(eth2 = case_when(
  grepl("N", ethnicity) ~ "N",
  grepl("H", ethnicity) ~ "H",
  !grepl("H|N", ethnicity) & nchar(ethnicity) > 1 ~ "M",
  TRUE ~ ethnicity
))

You could also do it "manually" in base r by indexing (note order matters given your priority):

df[grepl("H", df$ethnicity), "eth2"] <- "H"
df[grepl("N", df$ethnicity), "eth2"] <- "N"
df[!grepl("H|N", df$ethnicity) & nchar(df$ethnicity) > 1, "eth2"] <- "M"
df[nchar(df$ethnicity) %in% 1, "eth2"] <- df$ethnicity[nchar(df$ethnicity) %in% 1]

In both cases the output is:

#    person ethnicity eth2
# 1 PersonA       HAB    H
# 2 PersonB       NHB    N
# 3 PersonC        AB    M
# 4 PersonD       ABW    M
# 5 PersonE         A    A

Note this is based on your comment about assigning superiority (that N anywhere supersedes those with both N and H, etc)

CodePudding user response：

We could use str_extract. When the number of characters in the column is greater than 1, extract, the 'N', 'M' separately, do a coalesce with the extracted elements along with 'M' (thus if there is no match, we get 'M', or else it will be in the order we placed the inputs in coalecse, For the other case, i.e. number of characters is 1, return the column values. Thus, N supersedes 'H' no matter the position in the string.

library(dplyr)
library(stringr)
df1 %>%
   mutate(output = case_when(nchar(`Race/Ethnicity`) > 1 
   ~ coalesce(str_extract(`Race/Ethnicity`, 'N'), 
              str_extract(`Race/Ethnicity`, 'H'), "M"), 
    TRUE ~ `Race/Ethnicity`))

-output

   People Race/Ethnicity output
1 PersonA            HAB      H
2 PersonB            NHB      N
3 PersonC             AB      M
4 PersonD            ABW      M
5 PersonE              A      A

data

df1 <- structure(list(People = c("PersonA", "PersonB", "PersonC", "PersonD", 
"PersonE"), `Race/Ethnicity` = c("HAB", "NHB", "AB", "ABW", "A"
)), class = "data.frame", row.names = c(NA, -5L))