I have a dataset (rather untidy - but not my work ... helping a colleague), which has rows of values with some rows duplicated wrt one column but the other columns differ by virtue of "*"'s added to some elements. Repex below:-
a <- c("2020", "Rose", "r","r","s","s","i","i","r")
b <- c("2020", "Rose","r*","r*","s*","s*","s*","s*","s*")
c <- c("2020", "Lily","r","r","s","s","i","i","r")
d <- c("2020", "Tulip","r*","r*","r*","r*","s*","r*","r*")
e <- c("2020", "Tulip","s","s","r","s","s","r","r")
data <- rbind(a,b,c,d,e)
so my data frame looks like this ...
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
a "2020" "Rose" "r" "r" "s" "s" "i" "i" "r"
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
e "2020" "Tulip" "s" "s" "r" "s" "s" "r" "r"
I need to remove the rows that are duplicates in column 2 ("Rose', "Lily" etc) and selectively keep the rows with the *'s so it looks like this ...
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
I have the feeling that a function bundled in with lapply might be the right approach but have no idea how to proceed with that !! - any thoughts
CodePudding user response:
You can try this. For the second condition (*s) it only checks the 3rd column since it seems they're all or none.
rmv <- names( table( data[,2] )[table( data[,2] ) > 1] )
data[ !( data[,2] %in% rmv & !grepl("\\*",data[,3])), ]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
In case it has to select based on ANY * (at least one) use this
data[ !( data[,2] %in% rmv & apply( data[,3:9], 1, function(x)
any(!grepl("\\*",x)) )), ]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
CodePudding user response:
First, you talk about data frames, but so far you use a matrix. So let's make a data frame first.
df <- as.data.frame(data)
Second, we could use by()
which essentially works like lapply(split(x, g), FUN)
. As split factors we use the first two columns 1:2
and apply grepl
on each slice. Finally rbind
.
df <- by(df, df[1:2], \(x) {
if (nrow(x) > 1) {
x[grepl('\\*', x$V3), ]
} else x}) |> (\(.) do.call(rbind, .))()
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# c 2020 Lily r r s s i i r
# b 2020 Rose r* r* s* s* s* s* s*
# d 2020 Tulip r* r* r* r* s* r* r*
To clean the row names, add this:
|> `rownames<-`(NULL)
Note: R version 4.1.2 (2021-11-01).
Data:
data <- structure(c("2020", "2020", "2020", "2020", "2020", "Rose", "Rose",
"Lily", "Tulip", "Tulip", "r", "r*", "r", "r*", "s", "r", "r*",
"r", "r*", "s", "s", "s*", "s", "r*", "r", "s", "s*", "s", "r*",
"s", "i", "s*", "i", "s*", "s", "i", "s*", "i", "r*", "r", "r",
"s*", "r", "r*", "r"), .Dim = c(5L, 9L), .Dimnames = list(c("a",
"b", "c", "d", "e"), NULL))