r string data cleaning-CodePudding

I have dataset with some weird mix of strings like this

   ID   State
   1    NA
   2    IL
   3    IL,IL,IL
   4    OH,IL
   5    NM,NM,AL,AL
   6    FL,FL,FL

I like to

replace State values with NA if they are two different states and
replace State values with unique values if they are same but repeated.

Expected dataset

   ID   State
   1    NA
   2    IL
   3    IL
   4    NA
   5    NA
   6    FL

I tried paste(unique(df$State), collapse=",") but that did not work. Any suggestions regarding this is much appreciated. Thanks.

CodePudding user response：

Split the State values on comma (,) and replace it with NA values if there are two different values in them. Use distinct to keep only the unique rows for each ID.

library(dplyr)
library(tidyr)

df %>%
  separate_rows(State, sep = ',\\s*') %>%
  group_by(ID) %>%
  mutate(State = replace(State, n_distinct(State) > 1, NA)) %>%
  distinct() %>%
  ungroup()

#     ID State
#  <int> <chr>
#1     1 NA   
#2     2 IL   
#3     3 IL   
#4     4 NA   
#5     5 NA   
#6     6 FL

CodePudding user response：

Base R solution:

with(
  df,
  vapply(
    State,
    function(x){
      y <- toString(
        unique(
          unlist(
            strsplit(
              x,
              ","
              )
            )
          )
        )
      ifelse(
        grepl(
          ",|NA",
          y),
        NA_character_,
        y
      )
    },
    character(1),
    USE.NAMES = FALSE
  )
)

Tidyverse solution:

library(tidyverse)
str_split(df$State, ",") %>% 
  map(function(x) str_c(unique(x), collapse = ", ")) %>% 
  map_chr(function(y) if_else(str_detect(y, ","), NA_character_, y))

Data:

df <- structure(list(ID = 1:6, State = c(NA, "IL", "IL,IL,IL", "OH,IL", 
"NM,NM,AL,AL", "FL,FL,FL")), class = "data.frame", row.names = c(NA, 
-6L))