I've received raft of data sets with multiple pieces of data in a single column recently and a like the title suggests I'm trying to write a function to return some of the later split elements. Hunting around I've seen solutions on how to get just the first element, or just the last but not how to select which elements are returned. This looks like a persistent issues in these data sets so a solution that I can abstract would would be delightful.
Example: Ideally this function would return just the binomial names of these organism, but I don't want it anchored to the back of the string as some times there is more unneeded information after the names
library(tidyverse)
foo <- data.frame(id = paste0("a", 1:6),
Organisms = c("EA - Enterobacter aerogenes", "EA - Enterobacter aerogenes",
"KP - Klebsiella pneumoniae", "ACBA - Acinetobacter baumannii",
"ENC - Enterobacter cloacae", "KP - Klebsiella pneumoniae"))
## just the first element (does not allow you to select 2 elements)
Orgsplit_abrev <- function(x){
sapply(str_split(x," "), getElement, 1)
}
foo %>%
summarise(Orgsplit_abrev(Organisms))
str_split(foo$Organisms, " ")[[1]][c(3,4)]
CodePudding user response:
We may use tail
- as there are more than one element to be returned, return as a list
column
Orgsplit_abrev <- function(x){
lapply(str_split(x," "), tail, 2)
}
-testing
foo %>%
summarise(Orgsplit_abrev(Organisms))
Orgsplit_abrev(Organisms)
1 Enterobacter, aerogenes
2 Enterobacter, aerogenes
3 Klebsiella, pneumoniae
4 Acinetobacter, baumannii
5 Enterobacter, cloacae
6 Klebsiella, pneumoniae
Also, if we want to specify the index, create a lambda function
Orgsplit_abrev <- function(x){
lapply(str_split(x," "), function(x) x[c(3, 4)])
}
Or may also use Extract with [
Orgsplit_abrev <- function(x){
lapply(str_split(x," "),`[`, c(3, 4))
}
CodePudding user response:
Why don't you split using the "-" delimiter?
> str_split(foo$Organisms, "-") %>% do.call('rbind', .)
[,1] [,2]
[1,] "EA " " Enterobacter aerogenes"
[2,] "EA " " Enterobacter aerogenes"
[3,] "KP " " Klebsiella pneumoniae"
[4,] "ACBA " " Acinetobacter baumannii"
[5,] "ENC " " Enterobacter cloacae"
[6,] "KP " " Klebsiella pneumoniae"
tail is also a good idea, but I would use -2 instead of 2 to keep everything but the first two elements (hence it allows more messy names to be fully included):
Orgsplit_abrev <- function(x){
lapply(str_split(x," "), tail, -2)
}
or with the lambda function
Orgsplit_abrev <- function(x){
lapply(str_split(x," "), function(x) x[3:])
}