I'm trying to extract specified characters from one column, along with a corresponding row name in my dataframe. Each matched character and name needs to be put into its own dataframe so I can then use the data in an upset plot to show how much overlap there is between characters.
I have some working code but the problem is that its very, very long and clunky. I want to compact the code, I thought perhaps a for loop could work? or maybe case when?
Hopefully someone can help!
Dummy data:
name <- rep(1:100, c("rs123", "rs124", "rs125", "rs126", "rs127", "rs128", "rs129", "rs130"))
Source <- rep(1:100, c("dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP"))
chromosome <- rep(1:100, c("17", "17", "17", "17", "17", "17", "17", "17"))
evidence <- rep(1:100, c("TOPMed", "Frequency", "Cited", "Frequncy, Cited", "Disease", "ESP", "TOPMed", "GWAS", "1000Genome"))
Biomart <- data.frame(name, source, chromosome, evidence)
Current code:
'1000Genome' <- Biomart[str_detect(Biomart$evidence,"1000Genomes"),]$name
'freq' <- Biomart[str_detect(Biomart$evidence,"Frequency"),]$name
'TOPMed' <- Biomart[str_detect(Biomart$evidence,"TOPMed"),]$name
'gnomAD' <- Biomart[str_detect(Biomart$evidence, "gnomAD"),]$name
'Cited' <- Biomart[str_detect(Biomart$evidence, "Cited"),]$name
'ESP' <- Biomart[str_detect(Biomart$evidence, "ESP"),]$name
'ExAC' <- Biomart[str_detect(Biomart$evidence, "ExAC"),]$name
'Phenotype_or_Disease' <- Biomart[str_detect(Biomart$Variant.supporting.evidence, "Phenotype_or_Disease"),]$Variant.name
CodePudding user response:
You can simplify the code by putting the search strings in a vector, then sapply
the matching function str_detect
. This creates a logical matrix. Finally, use the matrix to get the strings you want.
I have left the matched strings like apply
returns them, in a named list, to leave them in a list is better than to have many vectors in the .GlobalEnv
. The matched strings vectors will in general not be of the same length, so it will not be possible to form a data.frame, not without further processing.
name <- rep(c("rs123", "rs124", "rs125", "rs126", "rs127", "rs128", "rs129", "rs130"), 100)[1:100]
source <- rep(c("dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP", "dbSNP"), 100)[1:100]
chromosome <- rep(c("17", "17", "17", "17", "17", "17", "17", "17"), 100)[1:100]
evidence <- rep(c("TOPMed", "Frequency", "Cited", "Frequency, Cited", "Disease", "ESP", "TOPMed", "GWAS", "1000Genome"), 100)[1:100]
Biomart <- data.frame(name, source, chromosome, evidence)
srch <- c("1000Genome", "Frequency", "TOPMed", "gnomAD", "Cited", "ESP", "ExAC")
inx <- sapply(srch, \(x) stringr::str_detect(Biomart$evidence, x))
apply(inx, 2, \(i) Biomart$name[i])
#> $`1000Genome`
#> [1] "rs123" "rs124" "rs125" "rs126" "rs127" "rs128" "rs129" "rs130" "rs123"
#> [10] "rs124" "rs125"
#>
#> $Frequency
#> [1] "rs124" "rs126" "rs125" "rs127" "rs126" "rs128" "rs127" "rs129" "rs128"
#> [10] "rs130" "rs129" "rs123" "rs130" "rs124" "rs123" "rs125" "rs124" "rs126"
#> [19] "rs125" "rs127" "rs126" "rs128"
#>
#> $TOPMed
#> [1] "rs123" "rs129" "rs124" "rs130" "rs125" "rs123" "rs126" "rs124" "rs127"
#> [10] "rs125" "rs128" "rs126" "rs129" "rs127" "rs130" "rs128" "rs123" "rs129"
#> [19] "rs124" "rs130" "rs125" "rs123" "rs126"
#>
#> $gnomAD
#> character(0)
#>
#> $Cited
#> [1] "rs125" "rs126" "rs126" "rs127" "rs127" "rs128" "rs128" "rs129" "rs129"
#> [10] "rs130" "rs130" "rs123" "rs123" "rs124" "rs124" "rs125" "rs125" "rs126"
#> [19] "rs126" "rs127" "rs127" "rs128"
#>
#> $ESP
#> [1] "rs128" "rs129" "rs130" "rs123" "rs124" "rs125" "rs126" "rs127" "rs128"
#> [10] "rs129" "rs130"
#>
#> $ExAC
#> character(0)
Created on 2022-11-12 with reprex v2.0.2