Merge rows when same word appears in another column-CodePudding

I want to merge rows together that have the same word in another column. The solution should be in R Base. The table entries are all comma separated strings (characters), not lists. So as shown below the color shades for the same color should be added in a string in one row, instead of spanning over several rows. Also there should be no duplicates in the color shades column.

I have already tried with:

   aggregate(df["Color shades"], df["Color"], paste, collapse=", ")

As well as with:

   aggregate(Color shades ~ Color ,df ,toString)

But that didn't lead to the desired result.

Dataframe:

    df <- data.frame(colorshades = c("turquoise, babyblue", "royal blue, true blue", 
                         "navy blue, true blue"), colors = c("blue", "blue", "blue"))

Currently:

Color shades	Color
turquoise, babyblue	blue
royal blue, true blue	blue
navy blue, true blue	blue

Desired Output:

Color shades	Color
turquoise, babyblue, royal blue, true blue, navy blue	blue

CodePudding user response：

Convert "Color shades" to a list-column:

lapply(strsplit(df[["Color shades"]], ","), trimws)
# [[1]]
# [1] "turquoise" "babyblue" 
# [[2]]
# [1] "royal blue" "true blue" 
# [[3]]
# [1] "navy blue" "true blue"
df[["Color shades"]] <- lapply(strsplit(df[["Color shades"]], ","), trimws)
df
#            Color shades Color
# 1   turquoise, babyblue  blue
# 2 royal blue, true blue  blue
# 3  navy blue, true blue  blue

Aggregate with unique:

aggregate(df["Color shades"], df["Color"], function(z) paste(unique(unlist(z)), collapse=", "))
#   Color                                          Color shades
# 1  blue turquoise, babyblue, royal blue, true blue, navy blue

or, keeping with the list-column method,

aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z))))
#   Color                                          Color shades
# 1  blue turquoise, babyblue, royal blue, true blue, navy blue
str(aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z)))))
# 'data.frame': 1 obs. of  2 variables:
#  $ Color       : chr "blue"
#  $ Color shades:List of 1
#   ..$ : chr  "turquoise" "babyblue" "royal blue" "true blue" ...

There are often (but not always) advantages to dealing with list-columns vice comma-separated values. If your use-case is such that you frequently want to look at individual elements within one of these fields, you'll find yourself dealing deeply with regexes and/or repeatedly using strsplit to separate. With list-columns, one can use tools like unique and %in% with abandon (though admittedly one should really become more comfortable with lapply/sapply, and many base-R tools for aggregation do not always work consistently with list-columns).

Data

df <- structure(list(`Color shades` = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"), Color = c("blue", "blue", "blue")), class = "data.frame", row.names = c(NA, -3L))

CodePudding user response：

if you can use the library "dplyr" you can also do it this way :

library(dplyr)

df <- data.frame("Colorshade" = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"),
             "Color" = c(rep("blue", 3)),
             stringsAsFactors = FALSE)

my_df <- df %>% group_by(Color) %>% mutate(Colorshade = paste(unique(sort(str_split(string = paste(df$Colorshade, collapse = ", "), pattern = ", ", simplify = TRUE))), collapse = ", ")) %>% first()

CodePudding user response：

data.table solution

library(data.table)
setDT(df)[, .(Color_shades = paste0(unique(unlist(strsplit(colorshades, ", "))), 
                                    collapse = ", ")), 
          by = .(colors)]
#    colors                                          Color_shades
# 1:   blue turquoise, babyblue, royal blue, true blue, navy blue

CodePudding user response：

One can also use tidytext to with unnest

library(dplyr)
library(tidytext)

color_df <- tibble(color= rep("blue", times = 3),
                       color_shades = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"))

color_shades_agg <- color_df %>% 
  unnest_tokens(word, color_shades, token = 'regex', pattern=", ") %>% 
  group_by(color) %>%
  distinct() %>% 
  summarise(color_shades = paste0(sort(word), collapse = ", "))