paste only part of a vector according to a T/F colum-CodePudding

Let's pretend I have a data frame with species that display colors.

df<-data.frame(name=paste("spec",1:5),
         ind=c("blue;green","red","green","red;green;blue",""))

And some (blue and red) of the colors actually mean something. I can just grepl() them and get a T/F-column

df$isredorblue<-grepl("blue|red",df$ind)

But now I want to know which of the meaningful colors were displayed in a column.

The desired result is:

> df
    name            ind isredorblue searchcolor
1 spec 1     blue;green        TRUE        blue
2 spec 2            red        TRUE         red
3 spec 3          green       FALSE       other
4 spec 4 red;green;blue        TRUE    red;blue
5 spec 5                      FALSE       other

I have tried gsub with a [^] but that doesn't really work because it matches all the letters so "r" or "e" or "d" not "red"...

>     gsub("[^red] ","",df$ind)
[1] "eree"    "red"     "ree"     "redreee" ""

And now I'm thinking of using strsplit... but can't seem to figure out my next step(s)

blabla<-strsplit(df$ind, split=";") 


blabla<-blabla[-which(!blabla %in% c("red","blue"))]

> blabla
[[1]]
[1] "red"

Please keep in mind this is a reprex, my actual data frame is a lot larger and there are different indicators "colors" that matter for different things so I need to be able to produce these columns in as few steps as possible

CodePudding user response：

Here are two approaches.

Using regex -

This creates a regex pattern from color to extract from ind column in the data. If there is no value extracted we replace the blank with 'other'.

color <- c('red', 'blue')
pat <- paste0(color, collapse = '|')
df$is_color_present <- grepl(pat, df$ind)
df$searchcolor <- sapply(stringr::str_extract_all(df$ind, pat), paste0, collapse = ';')
df$searchcolor[df$searchcolor == ''] <- 'other'
df
#    name            ind is_color_present searchcolor
#1 spec 1     blue;green             TRUE        blue
#2 spec 2            red             TRUE         red
#3 spec 3          green            FALSE       other
#4 spec 4 red;green;blue             TRUE    red;blue
#5 spec 5                           FALSE       other

Without regex using tidyverse -

We get the data in long format splitting on ; and keep only those values which is present in color.

library(dplyr)
library(tidyr)

df %>%
  separate_rows(ind, sep = ';') %>%
  group_by(name) %>%
  summarise(is_color_present = any(ind %in% color), 
            searchcolor = paste0(ind[ind %in% color], collapse = ';'), 
            searchcolor = replace(searchcolor, searchcolor == '', 'other'))

CodePudding user response：

Here's a concise solution:

library(dplyr)
library(stringr)

First define all the target colors as a vector:

targets <- c('red', 'blue')

Now use the vector, transformed into a regex alternation pattern, to extract the wanted colors in a new column:

df %>%
   mutate(colors = str_extract_all(ind, paste0(targets, collapse = "|")))
    name            ind    colors
1 spec 1     blue;green      blue
2 spec 2            red       red
3 spec 3          green          
4 spec 4 red;green;blue red, blue
5 spec 5

If you have many color names, some of which may share the same letters (such as "red" and "darkred"), you may want to wrap word boundaries around the color names:

df %>%
  mutate(colors = str_extract_all(ind, paste0("\\b(",paste0(targets, collapse = "|"), ")\\b")))

Here's another dplyr solution (not the most concise though):

df %>%
  mutate(
    blue = ifelse(grepl("blue", ind), "blue","other"),
    red = ifelse(grepl("red", ind), "red","other"),
    target = ifelse(blue=="blue"|red=="red", paste(red, blue), "other"),
    target = sub("^other\\s(?=blue|red)|(?<=blue|red)\\sother$", "", target, perl = TRUE)) %>%
  select(-c(3:5))
    name            ind   target
1 spec 1     blue;green     blue
2 spec 2            red      red
3 spec 3          green    other
4 spec 4 red;green;blue red blue
5 spec 5                   other

Data:

df<-data.frame(name=paste("spec",1:5),
               ind=c("blue;green","red","green","red;green;blue",""))