I have a dataframe with multiple values per cell, and I want to find and return the values that are only in one column.
ID <- c("1","1","1","2","2","2","3","3","3")
locus <- c("A","B","C","A","B","C","A","B","C")
Acceptable <- c("K44 Z33 G49","K72 QR123","B92 N12 PQ16 G99","V3","L89 I203 UPA66 QF29"," ","K44 Z33 K72","B92 PQ14","J22 M43 VC78")
Unacceptable <- c("K44 Z33 G48","K72 QR123 B22","B92 N12 PQ16 G99","V3 N9 Q7","L89 I203 UPA66 QF29","B8","K44 Z33"," ","J22 M43 VC78")
df <- data.frame(ID,locus,Acceptable,Unacceptable)
I want to make another column, Unique_values, that returns all the unique values that are only present in Unacceptable, and that are not in Acceptable. So the output should be this.
I already have a poorly optimized method to find the duplicates between the two columns:
df$Duplicate_values <- do.call(paste, c(df[,c("Acceptable","Unacceptable")], sep=" "))
df$Duplicate_values = sapply(strsplit(df$Duplicate_values, ' '), function(i)paste(i[duplicated(i)]))
#this is for cleaning up the text field so that it looks like the other columns
df$Duplicate_values = gsub("[^0-9A-Za-z///' ]"," ",df$Duplicate_values)
df$Duplicate_values = gsub("character 0",NA,df$Duplicate_values)
df$Duplicate_values = gsub("^c ","",df$Duplicate_values)
df$Duplicate_values = gsub(" "," ",df$Duplicate_values)
df$Duplicate_values = trimws(df$Duplicate_values)
(if anyone knows a faster method to return these duplicates, please let me now!)
I cannot use this method to find the unique values however, because it would then also return the unique values of the Acceptable column, which I do not want.
Any suggestions?
CodePudding user response:
A similar approach using setdiff
:
lA <- strsplit(df$Acceptable, " ")
lU <- strsplit(df$Unacceptable, " ")
df$Unique_values <- lapply(1:nrow(df), function(i) paste0(setdiff(lU[[i]], lA[[i]]), collapse = " "))
df
#> ID locus Acceptable Unacceptable Unique_values
#> 1 1 A K44 Z33 G49 K44 Z33 G48 G48
#> 2 1 B K72 QR123 K72 QR123 B22 B22
#> 3 1 C B92 N12 PQ16 G99 B92 N12 PQ16 G99
#> 4 2 A V3 V3 N9 Q7 N9 Q7
#> 5 2 B L89 I203 UPA66 QF29 L89 I203 UPA66 QF29
#> 6 2 C B8 B8
#> 7 3 A K44 Z33 K72 K44 Z33
#> 8 3 B B92 PQ14
#> 9 3 C J22 M43 VC78 J22 M43 VC78