Detecting rows with multiple observations in R-CodePudding

I have a dataset like this. I want to recognize all of the observations that have multiple values in the "color" columns and replace them with "multicolor"

ID  color1   color2
23   red      NA
44   blue     purple
51   yellow   NA
59   green    orange

like this:

ID  color   
23   red      
44   multicolor     
51   yellow     
59   multicolor

Any ideas would be much appreciated, thank you!

CodePudding user response：

Here is a way to do it in the tidyverse.

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(cols = starts_with("color"), values_to = "color", values_drop_na  = TRUE) %>% 
  group_by(ID) %>% 
  summarize(n = n(),
            color = toString(color), .groups = "drop") %>% 
  mutate(color = if_else(n > 1, "multicolor", color)) %>% 
  select(-n)

# # A tibble: 4 x 2
#      ID color     
#   <int> <chr>     
# 1    23 red       
# 2    44 multicolor
# 3    51 yellow    
# 4    59 multicolor

I did it this way on purpose. Note that if you stopped after the summarize() line, you would get the actual colors.

# # A tibble: 4 x 3
#      ID     n color        
#   <int> <int> <chr>        
# 1    23     1 red          
# 2    44     2 blue, purple 
# 3    51     1 yellow       
# 4    59     2 green, orange

This would scale if you had many color columns, not just 2. Play around with it, there are a lot of ways to tweaks things like this.

Data

df <- read.table(textConnection("ID  color1   color2
23   red      NA
44   blue     purple
51   yellow   NA
59   green    orange"), header = TRUE)

CodePudding user response：

You can do this, assuming data is your dataset.

library(dplyr)

data <- data.frame(ID = c(23, 44, 51, 59),
                   color1 = c("red", "blue", "yellow", "green"),
                   color2 = c(NA, "purple", NA, "orange"))

data %>% 
  mutate(color = ifelse(is.na(color2), color1, "multicolor")) %>% 
  select(ID, color)

CodePudding user response：

Here's what seems like a straightforward solution:

library(dplyr)
library(stringr)
data %>%
  mutate(
    # step 1 - paste `color1` and `color2` together and remove " NA":
    color = gsub("\\sNA", "", paste(color1, color2)),
    # step 2 - count the number of white space characters:
    color = str_count(color, " "),
    # step 3 - label `color` as "multicolor" where `color` != 0:
    color = ifelse(color == 0, color1, "multicolor")) %>%
  # remove the obsolete color columns: 
  select(-matches("\\d$"))
  ID      color
1 23        red
2 44 multicolor
3 51     yellow
4 59 multicolor

Data:

data <- data.frame(ID = c(23, 44, 51, 59),
                   color1 = c("red", "blue", "yellow", "green"),
                   color2 = c(NA, "purple", NA, "orange"))

CodePudding user response：

A base R approach

# get colors from columns named color*
colo <- paste(names(table(unlist(df1[,grep("color",colnames(df1))]))), collapse="|")

colo
[1] "blue|green|red|yellow|orange|purple"

# match the colors and do the conversion
data.frame( 
  ID=df1$ID, 
  color=apply( df1, 1, function(x){ 
    y=x[grep(colo, x)];
    if(length(y)>1){y="multicolor"}; y } ) )
  ID      color
1 23        red
2 44 multicolor
3 51     yellow
4 59 multicolor

Data

df1 <- structure(list(ID = c(23L, 44L, 51L, 59L), color1 = c("red", 
"blue", "yellow", "green"), color2 = c(NA, "purple", NA, "orange"
)), class = "data.frame", row.names = c(NA, -4L))