Home > Blockchain >  Determine which elements of a vector partially match a second vector, and which elements don't
Determine which elements of a vector partially match a second vector, and which elements don't

Time:09-25

I have a vector A, which contains a list of genera, which I want to use to subset a second vector, B. I have successfully used grepl to extract anything from B that has a partial match to the genera in A. Below is a reproducible example of what I have done.

But now I would like to get a list of which genera in A matched with something in B, and which which genera did not. I.e. the "matched" list would contain Cortinarius and Russula, and the "unmatched" list would contain Laccaria and Inocybe. Any ideas on how to do this? In reality my vectors are very long, and the genus names in B are not all in the same position amongst the other info.

# create some dummy vectors
A <- c("Cortinarius","Laccaria","Inocybe","Russula")
B <- c("fafsdf_Cortinarius_sdfsdf","sdfsdf_Russula_sdfsdf_fdf","Tomentella_sdfsdf","sdfas_Sebacina","sdfsf_Clavulina_sdfdsf")

# extract the elements of B that have a partial match to anything in A.
new.B <- B[grepl(paste(A,collapse="|"), B)]

# But now how do I tell which elements of A were present in B, and which ones were not?

CodePudding user response:

We could use lapply or sapply to loop over the patterns and then get a named output

out <- setNames(lapply(A, function(x) grep(x, B, value = TRUE)), A)

THen, it is easier to check the ones returning empty elements

> out[lengths(out) > 0]
$Cortinarius
[1] "fafsdf_Cortinarius_sdfsdf"

$Russula
[1] "sdfsdf_Russula_sdfsdf_fdf"

> out[lengths(out) == 0]
$Laccaria
character(0)

$Inocybe
character(0)

and get the names of that

> names(out[lengths(out) > 0])
[1] "Cortinarius" "Russula"    
> names(out[lengths(out) == 0])
[1] "Laccaria" "Inocybe" 

CodePudding user response:

You can use sapply with grepl to check for each value of A matching with ever value of B.

sapply(A, grepl, B)

#     Cortinarius Laccaria Inocybe Russula
#[1,]        TRUE    FALSE   FALSE   FALSE
#[2,]       FALSE    FALSE   FALSE    TRUE
#[3,]       FALSE    FALSE   FALSE   FALSE
#[4,]       FALSE    FALSE   FALSE   FALSE
#[5,]       FALSE    FALSE   FALSE   FALSE

You can take column-wise sum of these values to get the count of matches.

result <- colSums(sapply(A, grepl, B))
result

#Cortinarius    Laccaria     Inocybe     Russula 
#          1           0           0           1 

#values with at least one match
names(Filter(function(x) x > 0, result))
#[1] "Cortinarius" "Russula" 

#values with no match
names(Filter(function(x) x == 0, result))
#[1] "Laccaria" "Inocybe" 
  • Related