I have dataset with some weird mix of strings like this
ID State
1 NA
2 IL
3 IL,IL,IL
4 OH,IL
5 NM,NM,AL,AL
6 FL,FL,FL
I like to
- replace State values with NA if they are two different states and
- replace State values with unique values if they are same but repeated.
Expected dataset
ID State
1 NA
2 IL
3 IL
4 NA
5 NA
6 FL
I tried paste(unique(df$State), collapse=",")
but that did not work. Any suggestions regarding this is much appreciated. Thanks.
CodePudding user response:
Split the State
values on comma (,
) and replace it with NA
values if there are two different values in them. Use distinct
to keep only the unique rows for each ID
.
library(dplyr)
library(tidyr)
df %>%
separate_rows(State, sep = ',\\s*') %>%
group_by(ID) %>%
mutate(State = replace(State, n_distinct(State) > 1, NA)) %>%
distinct() %>%
ungroup()
# ID State
# <int> <chr>
#1 1 NA
#2 2 IL
#3 3 IL
#4 4 NA
#5 5 NA
#6 6 FL
CodePudding user response:
Base R solution:
with(
df,
vapply(
State,
function(x){
y <- toString(
unique(
unlist(
strsplit(
x,
","
)
)
)
)
ifelse(
grepl(
",|NA",
y),
NA_character_,
y
)
},
character(1),
USE.NAMES = FALSE
)
)
Tidyverse solution:
library(tidyverse)
str_split(df$State, ",") %>%
map(function(x) str_c(unique(x), collapse = ", ")) %>%
map_chr(function(y) if_else(str_detect(y, ","), NA_character_, y))
Data:
df <- structure(list(ID = 1:6, State = c(NA, "IL", "IL,IL,IL", "OH,IL",
"NM,NM,AL,AL", "FL,FL,FL")), class = "data.frame", row.names = c(NA,
-6L))