How can I write str_detect more efficiently?-CodePudding

I'm trying to extract specified characters from one column, along with a corresponding row name in my dataframe. Each matched character and name needs to be put into its own dataframe so I can then use the data in an upset plot to show how much overlap there is between characters.

I have some working code but the problem is that its very, very long and clunky. I want to compact the code, I thought perhaps a for loop could work? or maybe case when?

Hopefully someone can help!

Dummy data:

name <- rep(1:100, c("rs123", "rs124", "rs125", "rs126", "rs127", "rs128", "rs129", "rs130"))
Source <- rep(1:100, c("dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP"))
chromosome <- rep(1:100, c("17", "17", "17", "17", "17", "17", "17", "17"))
evidence <- rep(1:100, c("TOPMed", "Frequency", "Cited", "Frequncy, Cited", "Disease", "ESP", "TOPMed", "GWAS", "1000Genome"))

Biomart <- data.frame(name, source, chromosome, evidence)

Current code:

'1000Genome' <- Biomart[str_detect(Biomart$evidence,"1000Genomes"),]$name 

'freq' <- Biomart[str_detect(Biomart$evidence,"Frequency"),]$name

'TOPMed' <-  Biomart[str_detect(Biomart$evidence,"TOPMed"),]$name

'gnomAD' <- Biomart[str_detect(Biomart$evidence, "gnomAD"),]$name

'Cited' <- Biomart[str_detect(Biomart$evidence, "Cited"),]$name

'ESP' <- Biomart[str_detect(Biomart$evidence, "ESP"),]$name

'ExAC' <- Biomart[str_detect(Biomart$evidence, "ExAC"),]$name

'Phenotype_or_Disease' <- Biomart[str_detect(Biomart$Variant.supporting.evidence, "Phenotype_or_Disease"),]$Variant.name

CodePudding user response：

You can simplify the code by putting the search strings in a vector, then sapply the matching function str_detect. This creates a logical matrix. Finally, use the matrix to get the strings you want.

I have left the matched strings like apply returns them, in a named list, to leave them in a list is better than to have many vectors in the .GlobalEnv. The matched strings vectors will in general not be of the same length, so it will not be possible to form a data.frame, not without further processing.

name <- rep(c("rs123", "rs124", "rs125", "rs126", "rs127", "rs128", "rs129", "rs130"), 100)[1:100]
source <- rep(c("dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP"), 100)[1:100]
chromosome <- rep(c("17", "17", "17", "17", "17", "17", "17", "17"), 100)[1:100]
evidence <- rep(c("TOPMed", "Frequency", "Cited", "Frequency, Cited", "Disease", "ESP", "TOPMed", "GWAS", "1000Genome"), 100)[1:100]

Biomart <- data.frame(name, source, chromosome, evidence)

srch <- c("1000Genome", "Frequency", "TOPMed", "gnomAD", "Cited", "ESP", "ExAC")

inx <- sapply(srch, \(x) stringr::str_detect(Biomart$evidence, x))
apply(inx, 2, \(i) Biomart$name[i])
#> $`1000Genome`
#>  [1] "rs123" "rs124" "rs125" "rs126" "rs127" "rs128" "rs129" "rs130" "rs123"
#> [10] "rs124" "rs125"
#> 
#> $Frequency
#>  [1] "rs124" "rs126" "rs125" "rs127" "rs126" "rs128" "rs127" "rs129" "rs128"
#> [10] "rs130" "rs129" "rs123" "rs130" "rs124" "rs123" "rs125" "rs124" "rs126"
#> [19] "rs125" "rs127" "rs126" "rs128"
#> 
#> $TOPMed
#>  [1] "rs123" "rs129" "rs124" "rs130" "rs125" "rs123" "rs126" "rs124" "rs127"
#> [10] "rs125" "rs128" "rs126" "rs129" "rs127" "rs130" "rs128" "rs123" "rs129"
#> [19] "rs124" "rs130" "rs125" "rs123" "rs126"
#> 
#> $gnomAD
#> character(0)
#> 
#> $Cited
#>  [1] "rs125" "rs126" "rs126" "rs127" "rs127" "rs128" "rs128" "rs129" "rs129"
#> [10] "rs130" "rs130" "rs123" "rs123" "rs124" "rs124" "rs125" "rs125" "rs126"
#> [19] "rs126" "rs127" "rs127" "rs128"
#> 
#> $ESP
#>  [1] "rs128" "rs129" "rs130" "rs123" "rs124" "rs125" "rs126" "rs127" "rs128"
#> [10] "rs129" "rs130"
#> 
#> $ExAC
#> character(0)

^{Created on 2022-11-12 with reprex v2.0.2}