I want to merge rows together that have the same word in another column. The solution should be in R Base. The table entries are all comma separated strings (characters), not lists. So as shown below the color shades for the same color should be added in a string in one row, instead of spanning over several rows. Also there should be no duplicates in the color shades column.
I have already tried with:
aggregate(df["Color shades"], df["Color"], paste, collapse=", ")
As well as with:
aggregate(Color shades ~ Color ,df ,toString)
But that didn't lead to the desired result.
Dataframe:
df <- data.frame(colorshades = c("turquoise, babyblue", "royal blue, true blue",
"navy blue, true blue"), colors = c("blue", "blue", "blue"))
Currently:
Color shades | Color |
---|---|
turquoise, babyblue | blue |
royal blue, true blue | blue |
navy blue, true blue | blue |
Desired Output:
Color shades | Color |
---|---|
turquoise, babyblue, royal blue, true blue, navy blue | blue |
CodePudding user response:
Convert
"Color shades"
to a list-column:lapply(strsplit(df[["Color shades"]], ","), trimws) # [[1]] # [1] "turquoise" "babyblue" # [[2]] # [1] "royal blue" "true blue" # [[3]] # [1] "navy blue" "true blue" df[["Color shades"]] <- lapply(strsplit(df[["Color shades"]], ","), trimws) df # Color shades Color # 1 turquoise, babyblue blue # 2 royal blue, true blue blue # 3 navy blue, true blue blue
Aggregate with
unique
:aggregate(df["Color shades"], df["Color"], function(z) paste(unique(unlist(z)), collapse=", ")) # Color Color shades # 1 blue turquoise, babyblue, royal blue, true blue, navy blue
or, keeping with the list-column method,
aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z)))) # Color Color shades # 1 blue turquoise, babyblue, royal blue, true blue, navy blue str(aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z))))) # 'data.frame': 1 obs. of 2 variables: # $ Color : chr "blue" # $ Color shades:List of 1 # ..$ : chr "turquoise" "babyblue" "royal blue" "true blue" ...
There are often (but not always) advantages to dealing with list-columns vice comma-separated values. If your use-case is such that you frequently want to look at individual elements within one of these fields, you'll find yourself dealing deeply with regexes and/or repeatedly using strsplit
to separate. With list-columns, one can use tools like unique
and %in%
with abandon (though admittedly one should really become more comfortable with lapply
/sapply
, and many base-R tools for aggregation do not always work consistently with list-columns).
Data
df <- structure(list(`Color shades` = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"), Color = c("blue", "blue", "blue")), class = "data.frame", row.names = c(NA, -3L))
CodePudding user response:
if you can use the library "dplyr" you can also do it this way :
library(dplyr)
df <- data.frame("Colorshade" = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"),
"Color" = c(rep("blue", 3)),
stringsAsFactors = FALSE)
my_df <- df %>% group_by(Color) %>% mutate(Colorshade = paste(unique(sort(str_split(string = paste(df$Colorshade, collapse = ", "), pattern = ", ", simplify = TRUE))), collapse = ", ")) %>% first()
CodePudding user response:
data.table
solution
library(data.table)
setDT(df)[, .(Color_shades = paste0(unique(unlist(strsplit(colorshades, ", "))),
collapse = ", ")),
by = .(colors)]
# colors Color_shades
# 1: blue turquoise, babyblue, royal blue, true blue, navy blue
CodePudding user response:
One can also use tidytext
to with unnest
library(dplyr)
library(tidytext)
color_df <- tibble(color= rep("blue", times = 3),
color_shades = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"))
color_shades_agg <- color_df %>%
unnest_tokens(word, color_shades, token = 'regex', pattern=", ") %>%
group_by(color) %>%
distinct() %>%
summarise(color_shades = paste0(sort(word), collapse = ", "))