When concatenating strings using dplyr
, group_by()
and mutate()
or summarize ()
with paste()
and collapse
, NA
values are coerced to a string "NA"
.
When using str_c()
instead of paste()
, strings that are concatenated with NA
are removed (?str_c
: whenever a missing value is combined with another string the result will always be missing). When having such combinations of NA
& non-NA
values, how can I remove the NA
instead of the non-NA
in the concatenation?
See my example below:
library(dplyr)
library(stringr)
ID <- c(1,1,2,2,3,4)
string <- c(' asfdas ', 'sdf', NA,'sadf', 'NA', NA)
df <- data.frame(ID, string)
# ID string
# 1 1 asfdas
# 2 1 sdf
# 3 2 <NA> # ID 2 has both NA and non-NA values
# 4 2 sadf #
# 5 3 NA
# 6 4 <NA>
Both,
df%>%
group_by(ID)%>%
summarize(string = paste(string, collapse = "; "))%>%
distinct_all()
and
df_conca <-df%>%
group_by(ID)%>%
dplyr::mutate(string = paste(string, collapse = "; "))%>%
distinct_all()
result in
ID string
1 1 " asfdas ; sdf"
2 2 "NA; sadf"
3 3 "NA"
4 4 "NA" # NA coerced to "NA"
I.e. NA
becomes "NA":
while
df %>%
group_by(ID)%>%
summarize(string = str_c(string, collapse = "; "))
results in:
ID string
1 1 " asfdas ; sdf"
2 2 NA
3 3 "NA"
4 4 NA
I.e. "sadf" is removed according to the str_c
rule: NA
combined with string, results in NA
.
However, I would like to keep the true NA
values (in e.g. 'ID' 4) and the strings only (in e.g. 'ID' 2), as such:
ID string
1 1 " asfdas ; sdf"
2 2 "sadf"
3 3 "NA"
4 4 NA
Ideally, I would like to stay within the dplyr
workflow.
This question is an extension of Concatenating strings / rows using dplyr, group_by & collapse or summarize, but maintain NA values
CodePudding user response:
Using pivot_wider
and unite
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rn = rowid(ID)) %>%
pivot_wider(names_from = rn, values_from = string) %>%
unite(string, `1`, `2`, na.rm = TRUE, sep = " ; ")%>%
mutate(string = na_if(string, ""))
-output
# A tibble: 4 x 2
ID string
<dbl> <chr>
1 1 " asfdas ; sdf"
2 2 "sadf"
3 3 "NA"
4 4 <NA>
Or may also use coalesce
df %>%
group_by(ID) %>%
summarise(string = na_if(coalesce(str_c(string, collapse = " ; "),
str_c(string[complete.cases(string)], collapse = " ; ")), ""))
-output
# A tibble: 4 x 2
ID string
<dbl> <chr>
1 1 " asfdas ; sdf"
2 2 "sadf"
3 3 "NA"
4 4 <NA>
CodePudding user response:
Here's a solution in the dplyr framework. This removes all the 'NA' values using filter() - which initially loses ID 4 - then replaces the missing ID using a join.
df_IDs <- data.frame(ID = unique(df$ID))
df%>%
group_by(ID)%>%
filter(!is.na(string)) %>%
summarize(string = paste(string, collapse = "; ")) %>%
full_join(df_IDs, by = "ID")
results in
ID string
1 1 " asfdas ; sdf"
2 2 "sadf"
3 3 "NA"
4 4 NA