Let's pretend I have a data frame with species that display colors.
df<-data.frame(name=paste("spec",1:5),
ind=c("blue;green","red","green","red;green;blue",""))
And some (blue and red) of the colors actually mean something. I can just grepl()
them and get a T/F-column
df$isredorblue<-grepl("blue|red",df$ind)
But now I want to know which of the meaningful colors were displayed in a column.
The desired result is:
> df
name ind isredorblue searchcolor
1 spec 1 blue;green TRUE blue
2 spec 2 red TRUE red
3 spec 3 green FALSE other
4 spec 4 red;green;blue TRUE red;blue
5 spec 5 FALSE other
I have tried gsub with a [^] but that doesn't really work because it matches all the letters so "r" or "e" or "d" not "red"...
> gsub("[^red] ","",df$ind)
[1] "eree" "red" "ree" "redreee" ""
And now I'm thinking of using strsplit... but can't seem to figure out my next step(s)
blabla<-strsplit(df$ind, split=";")
blabla<-blabla[-which(!blabla %in% c("red","blue"))]
> blabla
[[1]]
[1] "red"
Please keep in mind this is a reprex, my actual data frame is a lot larger and there are different indicators "colors" that matter for different things so I need to be able to produce these columns in as few steps as possible
CodePudding user response:
Here are two approaches.
- Using regex -
This creates a regex pattern from color
to extract from ind
column in the data. If there is no value extracted we replace the blank with 'other'
.
color <- c('red', 'blue')
pat <- paste0(color, collapse = '|')
df$is_color_present <- grepl(pat, df$ind)
df$searchcolor <- sapply(stringr::str_extract_all(df$ind, pat), paste0, collapse = ';')
df$searchcolor[df$searchcolor == ''] <- 'other'
df
# name ind is_color_present searchcolor
#1 spec 1 blue;green TRUE blue
#2 spec 2 red TRUE red
#3 spec 3 green FALSE other
#4 spec 4 red;green;blue TRUE red;blue
#5 spec 5 FALSE other
- Without regex using
tidyverse
-
We get the data in long format splitting on ;
and keep only those values which is present in color
.
library(dplyr)
library(tidyr)
df %>%
separate_rows(ind, sep = ';') %>%
group_by(name) %>%
summarise(is_color_present = any(ind %in% color),
searchcolor = paste0(ind[ind %in% color], collapse = ';'),
searchcolor = replace(searchcolor, searchcolor == '', 'other'))
CodePudding user response:
Here's a concise solution:
library(dplyr)
library(stringr)
First define all the target colors as a vector:
targets <- c('red', 'blue')
Now use the vector, transformed into a regex alternation pattern, to extract the wanted colors in a new column:
df %>%
mutate(colors = str_extract_all(ind, paste0(targets, collapse = "|")))
name ind colors
1 spec 1 blue;green blue
2 spec 2 red red
3 spec 3 green
4 spec 4 red;green;blue red, blue
5 spec 5
If you have many color names, some of which may share the same letters (such as "red" and "darkred"), you may want to wrap word boundaries around the color names:
df %>%
mutate(colors = str_extract_all(ind, paste0("\\b(",paste0(targets, collapse = "|"), ")\\b")))
Here's another dplyr
solution (not the most concise though):
df %>%
mutate(
blue = ifelse(grepl("blue", ind), "blue","other"),
red = ifelse(grepl("red", ind), "red","other"),
target = ifelse(blue=="blue"|red=="red", paste(red, blue), "other"),
target = sub("^other\\s(?=blue|red)|(?<=blue|red)\\sother$", "", target, perl = TRUE)) %>%
select(-c(3:5))
name ind target
1 spec 1 blue;green blue
2 spec 2 red red
3 spec 3 green other
4 spec 4 red;green;blue red blue
5 spec 5 other
Data:
df<-data.frame(name=paste("spec",1:5),
ind=c("blue;green","red","green","red;green;blue",""))