Morning StackOverflow,
I am creating a function that searches through a single column ColumnOfDatasetToSearch
of a matrix Dataset
for a number of search terms SearchFeatures
. It works well for matrices that have 10^4 rows but really slows down when the row number gets above 10^6 or when SearchFeatures
has more than a 100 terms. I thought that vectorizing the ColumnOfDatasetToSearch
would improve my speed but only had modest performance improvement.
ListSearcher <- function(SearchFeatures, Dataset, ColumnOfDatasetToSearch){
RowNumber <- NA
ColumnOfInterest <- pull(Dataset, ColumnOfDatasetToSearch)
LengthOfSearchTerms <- length(SearchFeatures)
for (j in 1:LengthOfSearchTerms){
if(length(i <- grep(SearchFeatures[j], ColumnOfInterest)))
RowNumber <- append(RowNumber, i)
}
IdentifiersWithThoseSerchTerms <- unique(na.omit((Dataset$Identifiers[RowNumber])))
return(IdentifiersWithThoseSerchTerms)
}
Thanks in advance for your suggestions.
NewToCoding
CodePudding user response:
Imagine you are using the dataset iris
and want to return the column Petal.Length
instead of Identifiers
.
Does this work? It should be considerably faster
ListSearcher <- function(SearchFeatures, Dataset, ColumnOfDatasetToSearch){
searchstring <- paste0(SearchFeatures, collapse = "|")
selection <- grepl(searchstring, Dataset[[ColumnOfDatasetToSearch]])
Dataset[selection, ]$Petal.Length
}
# try with a subset of iris
iris[c(1,2,51,52), ]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 51 7.0 3.2 4.7 1.4 versicolor
#> 52 6.4 3.2 4.5 1.5 versicolor
ListSearcher(c("ver", "se") , iris[c(1,2,51,52), ], "Species")
#> [1] 1.4 1.4 4.7 4.5