I want to remove duplicates within the string. For e.g. Predictive Modeling
is a duplicated value in the first row. Need to make sure after removing duplicates, string does not have extra ,
mydf <- data.frame(Keyword = c("Predictive Modeling, R, Python, Predictive Modeling, SQL, Tableau, data analysis", "SQL, Tableau, data analysis, data analysis", "Predictive Modeling, Python, SQL, visualization, Spark, Tableau"))
Desired Output
mydf <- data.frame(Keyword = c("Predictive Modeling, R, Python, SQL, Tableau, data analysis", "SQL, Tableau, data analysis", "Predictive Modeling, Python, SQL, visualization, Spark, Tableau"))
CodePudding user response:
Here is a base R approach using strsplit
:
mydf <- data.frame(Keyword = c("Predictive Modeling, R, Python, Predictive Modeling, SQL, Tableau, data analysis", "SQL, Tableau, data analysis, data analysis"))
mydf$Keyword <- unlist(
lapply(strsplit(mydf$Keyword, ", "),
function(x) paste(unique(x),
collapse = ", "))
)
mydf
#> Keyword
#> 1 Predictive Modeling, R, Python, SQL, Tableau, data analysis
#> 2 SQL, Tableau, data analysis
Created on 2022-03-27 by the reprex package (v0.3.0)
CodePudding user response:
Here a one-liner using toString
.
transform(mydf, Keyword=sapply(strsplit(Keyword, ', '), \(x) toString(unique(x))))
# Keyword
# 1 Predictive Modeling, R, Python, SQL, Tableau, data analysis
# 2 SQL, Tableau, data analysis
# 3 Predictive Modeling, Python, SQL, visualization, Spark, Tableau
CodePudding user response:
Here is a regex based approach. We can replace any CSV term, which also appears later in the string, with empty string.
mydf <- data.frame(Keyword = c("Predictive Modeling, R, Python, Predictive Modeling, SQL, Tableau, data analysis", "SQL, Tableau, data analysis, data analysis"))
mydf$Keyword <- gsub("\\s*([^,] ),?(?=.*\\1(?:,|$))", "",
mydf$Keyword, perl=TRUE)
mydf
1 R, Python, Predictive Modeling, SQL, Tableau, data analyis
2 SQL, Tableau, data analyis
Notr that this approach retains the last instance of a CSV term, but maybe this is acceptable for your requirements.
CodePudding user response:
Using tidyverse
library(dplyr)
library(tidyr)
mydf %>%
mutate(rn = row_number()) %>%
separate_rows(Keyword, sep =",\\s ") %>%
distinct() %>%
group_by(rn) %>%
summarise(Keyword = toString(Keyword), .groups = "drop") %>%
select(-rn)
-output
# A tibble: 3 × 1
Keyword
<chr>
1 Predictive Modeling, R, Python, SQL, Tableau, data analysis
2 SQL, Tableau, data analysis
3 Predictive Modeling, Python, SQL, visualization, Spark, Tableau