remove duplicates within the string-CodePudding

I want to remove duplicates within the string. For e.g. Predictive Modeling is a duplicated value in the first row. Need to make sure after removing duplicates, string does not have extra ,

mydf <- data.frame(Keyword = c("Predictive Modeling, R, Python, Predictive Modeling, SQL, Tableau, data analysis", "SQL, Tableau, data analysis, data analysis", "Predictive Modeling, Python, SQL, visualization, Spark, Tableau"))

Desired Output

mydf <- data.frame(Keyword = c("Predictive Modeling, R, Python, SQL, Tableau, data analysis", "SQL, Tableau, data analysis", "Predictive Modeling, Python, SQL, visualization, Spark, Tableau"))

CodePudding user response：

Here is a base R approach using strsplit:


mydf <- data.frame(Keyword = c("Predictive Modeling, R, Python, Predictive Modeling, SQL, Tableau, data analysis", "SQL, Tableau, data analysis, data analysis"))

mydf$Keyword <- unlist(
  lapply(strsplit(mydf$Keyword, ", "),
         function(x) paste(unique(x),
                           collapse = ", "))
  )

mydf
#>                                                       Keyword
#> 1 Predictive Modeling, R, Python, SQL, Tableau, data analysis
#> 2                                 SQL, Tableau, data analysis

^{Created on 2022-03-27 by the reprex package (v0.3.0)}

CodePudding user response：

Here a one-liner using toString.

transform(mydf, Keyword=sapply(strsplit(Keyword, ', '), \(x) toString(unique(x))))
#                                                           Keyword
# 1     Predictive Modeling, R, Python, SQL, Tableau, data analysis
# 2                                     SQL, Tableau, data analysis
# 3 Predictive Modeling, Python, SQL, visualization, Spark, Tableau

CodePudding user response：

Here is a regex based approach. We can replace any CSV term, which also appears later in the string, with empty string.

mydf <- data.frame(Keyword = c("Predictive Modeling, R, Python, Predictive Modeling, SQL, Tableau, data analysis", "SQL, Tableau, data analysis, data analysis"))
mydf$Keyword <- gsub("\\s*([^,] ),?(?=.*\\1(?:,|$))", "", 
mydf$Keyword, perl=TRUE)
mydf

1  R, Python, Predictive Modeling, SQL, Tableau, data analyis
2                                  SQL, Tableau, data analyis

Notr that this approach retains the last instance of a CSV term, but maybe this is acceptable for your requirements.

CodePudding user response：

Using tidyverse

library(dplyr)
library(tidyr)
mydf %>% 
  mutate(rn = row_number()) %>% 
  separate_rows(Keyword, sep =",\\s ") %>% 
  distinct() %>% 
  group_by(rn) %>% 
  summarise(Keyword = toString(Keyword), .groups = "drop") %>%
  select(-rn)

-output

# A tibble: 3 × 1
  Keyword                                                        
  <chr>                                                          
1 Predictive Modeling, R, Python, SQL, Tableau, data analysis    
2 SQL, Tableau, data analysis                                    
3 Predictive Modeling, Python, SQL, visualization, Spark, Tableau