Highly selective filtering based on row contents-CodePudding

I have a dataset (rather untidy - but not my work ... helping a colleague), which has rows of values with some rows duplicated wrt one column but the other columns differ by virtue of "*"'s added to some elements. Repex below:-

a <- c("2020", "Rose", "r","r","s","s","i","i","r")
b <- c("2020", "Rose","r*","r*","s*","s*","s*","s*","s*")
c <- c("2020", "Lily","r","r","s","s","i","i","r")
d <- c("2020", "Tulip","r*","r*","r*","r*","s*","r*","r*")
e <- c("2020", "Tulip","s","s","r","s","s","r","r")

data <- rbind(a,b,c,d,e)

so my data frame looks like this ...

  [,1]   [,2]    [,3] [,4] [,5] [,6] [,7] [,8] [,9]
a "2020" "Rose"  "r"  "r"  "s"  "s"  "i"  "i"  "r" 
b "2020" "Rose"  "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily"  "r"  "r"  "s"  "s"  "i"  "i"  "r" 
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
e "2020" "Tulip" "s"  "s"  "r"  "s"  "s"  "r"  "r"

I need to remove the rows that are duplicates in column 2 ("Rose', "Lily" etc) and selectively keep the rows with the *'s so it looks like this ...

  [,1]   [,2]    [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose"  "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily"  "r"  "r"  "s"  "s"  "i"  "i"  "r" 
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"

I have the feeling that a function bundled in with lapply might be the right approach but have no idea how to proceed with that !! - any thoughts

CodePudding user response：

You can try this. For the second condition (*s) it only checks the 3rd column since it seems they're all or none.

rmv <- names( table( data[,2] )[table( data[,2] ) > 1] )

data[ !( data[,2] %in% rmv & !grepl("\\*",data[,3])), ]
  [,1]   [,2]    [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose"  "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily"  "r"  "r"  "s"  "s"  "i"  "i"  "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"

In case it has to select based on ANY * (at least one) use this

data[ !( data[,2] %in% rmv & apply( data[,3:9], 1, function(x) 
  any(!grepl("\\*",x)) )), ]
  [,1]   [,2]    [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose"  "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily"  "r"  "r"  "s"  "s"  "i"  "i"  "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"

CodePudding user response：

First, you talk about data frames, but so far you use a matrix. So let's make a data frame first.

df <- as.data.frame(data)

Second, we could use by() which essentially works like lapply(split(x, g), FUN). As split factors we use the first two columns 1:2 and apply grepl on each slice. Finally rbind.

df <- by(df, df[1:2], \(x) {
  if (nrow(x) > 1) {
    x[grepl('\\*', x$V3), ]
  } else x}) |> (\(.) do.call(rbind, .))()

df
#     V1    V2 V3 V4 V5 V6 V7 V8 V9
# c 2020  Lily  r  r  s  s  i  i  r
# b 2020  Rose r* r* s* s* s* s* s*
# d 2020 Tulip r* r* r* r* s* r* r*

To clean the row names, add this:

|> `rownames<-`(NULL)

Note: R version 4.1.2 (2021-11-01).

Data:

data <- structure(c("2020", "2020", "2020", "2020", "2020", "Rose", "Rose", 
"Lily", "Tulip", "Tulip", "r", "r*", "r", "r*", "s", "r", "r*", 
"r", "r*", "s", "s", "s*", "s", "r*", "r", "s", "s*", "s", "r*", 
"s", "i", "s*", "i", "s*", "s", "i", "s*", "i", "r*", "r", "r", 
"s*", "r", "r*", "r"), .Dim = c(5L, 9L), .Dimnames = list(c("a", 
"b", "c", "d", "e"), NULL))