Home > Mobile >  r string data cleaning
r string data cleaning

Time:10-11

I have dataset with some weird mix of strings like this

   ID   State
   1    NA
   2    IL
   3    IL,IL,IL
   4    OH,IL
   5    NM,NM,AL,AL
   6    FL,FL,FL

I like to

  • replace State values with NA if they are two different states and
  • replace State values with unique values if they are same but repeated.

Expected dataset

   ID   State
   1    NA
   2    IL
   3    IL
   4    NA
   5    NA
   6    FL

I tried paste(unique(df$State), collapse=",") but that did not work. Any suggestions regarding this is much appreciated. Thanks.

CodePudding user response:

Split the State values on comma (,) and replace it with NA values if there are two different values in them. Use distinct to keep only the unique rows for each ID.

library(dplyr)
library(tidyr)

df %>%
  separate_rows(State, sep = ',\\s*') %>%
  group_by(ID) %>%
  mutate(State = replace(State, n_distinct(State) > 1, NA)) %>%
  distinct() %>%
  ungroup()

#     ID State
#  <int> <chr>
#1     1 NA   
#2     2 IL   
#3     3 IL   
#4     4 NA   
#5     5 NA   
#6     6 FL   

CodePudding user response:

Base R solution:

with(
  df,
  vapply(
    State,
    function(x){
      y <- toString(
        unique(
          unlist(
            strsplit(
              x,
              ","
              )
            )
          )
        )
      ifelse(
        grepl(
          ",|NA",
          y),
        NA_character_,
        y
      )
    },
    character(1),
    USE.NAMES = FALSE
  )
)

Tidyverse solution:

library(tidyverse)
str_split(df$State, ",") %>% 
  map(function(x) str_c(unique(x), collapse = ", ")) %>% 
  map_chr(function(y) if_else(str_detect(y, ","), NA_character_, y))

Data:

df <- structure(list(ID = 1:6, State = c(NA, "IL", "IL,IL,IL", "OH,IL", 
"NM,NM,AL,AL", "FL,FL,FL")), class = "data.frame", row.names = c(NA, 
-6L))
  • Related