Home > Enterprise >  How to remove certain characters that don't belong to any characters of an other character vect
How to remove certain characters that don't belong to any characters of an other character vect

Time:02-20

I have a uncleaned character vector, and I want to remove certain characters in that vector that don't belong to another character vector. So basically I know what I want to keep, but I don't know exactly what to remove, which makes gsub() and str_replace_all hard to work.

The character string I want to clean is issue_uncleaned, and it looks like this (not the complete version):

[1] "Facebook Fact-checks; Coronavirus; TikTok posts "                                            
[2] "Facebook Fact-checks; Facebook posts "                                                       
[3] "Facebook Fact-checks; Coronavirus; Bloggers "                                                
[4] "Facebook Fact-checks; Facebook posts "                                                       
[5] "National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts "    

The character string I want to use as a filter to remove unwanted characters is 151_issues, and it looks like this(not the complete version):

[1] "Facebook Fact-checks"         "Coronavirus"       “Crime”                                      

My desired results: (if there are also ways to remove the ; at the beginning or at the last, it would be better)

[1] "Facebook Fact-checks; Coronavirus;  "                                            
[2] "Facebook Fact-checks;  "                                                       
[3] "Facebook Fact-checks; Coronavirus;  "                                                
[4] "Facebook Fact-checks;  "                                                       
[5] "; ; Crime; Facebook Fact-checks;  "  

Many thanks for your help!

CodePudding user response:

Using strsplit then intersect and paste again.

sapply(lapply(strsplit(v, '; '), intersect, issues), paste, collapse='; ')
# [1] "Facebook Fact-checks; Coronavirus" "Facebook Fact-checks"             
# [3] "Facebook Fact-checks; Coronavirus" "Facebook Fact-checks"             
# [5] "Facebook Fact-checks"      

Data:

v <- c("Facebook Fact-checks; Coronavirus; TikTok posts", "Facebook Fact-checks; Facebook posts", 
"Facebook Fact-checks; Coronavirus; Bloggers", "Facebook Fact-checks; Facebook posts", 
"National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts"
)
issues <- c("Facebook Fact-checks", "After the Fact", "Animals", "Bankruptcy", 
"Border Security", "Ad Watch", "Agriculture", "Ask PolitiFact", 
"Baseball", "Bush Administration", "Afghanistan", "Alcohol", 
"Autism", "Bipartisanship", "Coronavirus")

CodePudding user response:

issue_uncleaned <- c("Facebook Fact-checks; Coronavirus; TikTok posts ", "Facebook Fact-checks; Facebook posts ", "Facebook Fact-checks; Coronavirus; Bloggers ", "Facebook Fact-checks; Facebook posts ", "National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts ")
issues_151 <- c("Facebook Fact-checks", "Coronavirus", "Crime")
k <- strsplit(issue_uncleaned, "; ")
k <- lapply(k, trimws) # removes the white space at the end or beginning
k2 <- sapply(1:length(k), function(x, data){return(data[[x]][which(data[[x]] %in% issues_151)])}, data = k)
issue_cleaned <- sapply(k2, paste0, collapse = "; ")
  • Related