In R data.table conditionally remove rows based on elements in other columns with multiple condition-CodePudding

How do I remove rows in a data.table based on rows in that data.table? Reproduceable Example:

library(data.table)
DT = data.table(
  Subject = rep(c("A", "B"), 4:3),
  Test = rep(c("TEST_A","TEST_B"), 4:3),
  Folder = rep(c("D1", "Screen"), 4:3),
  Date = as.Date(
    sprintf("10-d-d", c(22:25, 25:27), rep(1:2, 4:3)),
    '%m-%d-%y'
  )
)

DT[Test == "TEST_A", Date := "na"]

Here is my logic: For every SUBJECT Remove all rows where TEST_A & folder "D1" has no date but TEST_B & Folder "Screen" has a date. I know I can just write an if statement, but I am trying to keep this using only data.table for readability.

CodePudding user response：

If I am understanding your question correctly, you want to remove the rows that have a date value of NA. If that's the case then the below answer should work.

Please note that I am not that familiar with data.table, so I had to convert DT to a dataframe first.

library(dplyr)

# Convert to a dataframe 
DT_df = as.data.frame(DT)

# Drop rows with NAs
DT_df = DT_df %>%  drop_na()

# Convert back to a data.table
DT_New = as.data.table(DT_df, TRUE)

Output:

dput(DT_df)

structure(list(Subject = c("B", "B", "B"), Test = c("TEST_B", 
"TEST_B", "TEST_B"), Folder = c("Screen", "Screen", "Screen"), 
    Date = structure(c(11985, 11986, 11987), class = "Date")), index = structure(integer(0), "`__Test`" = integer(0)), class = "data.frame", row.names = c(NA, 
-3L))

CodePudding user response：

Try this:

DT[, .SD[ !(any(is.na(Date) & Test == "TEST_A" & Folder == "D1") && 
            any(!is.na(Date) & Test == "TEST_B" & Folder == "Screen")), ], by = Subject]

Your sample data does not contain the conditions you specified, so this does nothing here but should work with more representative data.