How can I remove all rows of a dataframe that have the same string value across a subset of columns-CodePudding

I have a dataframe that looks like this:

ID	Time	Q1	Q2	Q3	Q4
1	2 min	Agree	NA	Neutral	NA
2	5 min	NA	Disagree	Agree	NA
3	3 min	Agree	NA	Neutral	NA
4	5 min	Disagree	Disagree	NA	NA
5	6 min	NA	Agree	Agree	Agree
6	1 min	NA	NA	NA	NA

I want to retain only the rows for which responses to the questions are not equivalent across Question columns (Q1:Q4). In this example, I would retain rows for IDs 1-3 and remove rows 4-6, as they are all identical strings. I do want to keep the information from the first two columns, but I don't want to use it in the decision logic regarding whether or not to keep the row. All rows have NAs, but the NAs are in various places - so I want to remove rows for which all columns that have any value in them are the same, or rows for which all values across columns are missing.

I found another answer that did something similar and tried this:

keep <- apply(df[3:6], 1, function(x) length(unique(x[!is.na(x)])) != 1)
df[keep, ]

but that seemed to only remove rows that were all NAs.

CodePudding user response：

To keep all those with discordant responses (assumed ignoring NAs), you can try this slight change in your attempted code:

keeps <- apply(df[3:6], 1, function(x) !length(unique(x[!is.na(x)])) %in% 0:1)
df[keeps, ]

#    ID Time    Q1       Q2      Q3   Q4
# 1  1 2min Agree     <NA> Neutral <NA>
# 2  2 5min  <NA> Disagree   Agree <NA>
# 3  3 3min Agree     <NA> Neutral <NA>

If you want all those with identical responses (assumed ignoring NAs):

keeps <- apply(df[3:6], 1, function(x) length(unique(x[!is.na(x)])) == 1)
df[keeps, ]

#    ID Time       Q1       Q2    Q3    Q4
# 4  4 5min Disagree Disagree  <NA>  <NA>
# 5  5 6min     <NA>    Agree Agree Agree

Data

df <- read.table(text = "ID Time    Q1  Q2  Q3  Q4
1   2min    Agree   NA  Neutral NA
2   5min    NA  Disagree    Agree   NA
3   3min    Agree   NA  Neutral NA
4   5min    Disagree    Disagree    NA  NA
5   6min    NA  Agree   Agree   Agree
6   1min    NA  NA  NA  NA", header = TRUE)

CodePudding user response：

Here's a dplyr-based solution: group using rowwise(), then filter to rows with > 1 distinct value across columns.

library(dplyr)

df %>% 
  rowwise() %>% 
  filter(n_distinct(c_across(Q1:Q4), na.rm = TRUE) > 1) %>% 
  ungroup()

# A tibble: 3 × 6
     ID Time  Q1    Q2       Q3      Q4   
  <int> <chr> <chr> <chr>    <chr>   <chr>
1     1 2min  Agree NA       Neutral NA   
2     2 5min  NA    Disagree Agree   NA   
3     3 3min  Agree NA       Neutral NA

CodePudding user response：

data <- data.frame(ID = 1:6, Time = c("2 min", "5 min", "3 min", 
"5 min", "6 min", "1 min"), Q1 = c("Agree", NA, "Agree", "Disagree", 
NA, NA), Q2 = c(NA, "Disagree", NA, "Disagree", "Agree", NA), 
    Q3 = c("Neutral", "Agree", "Neutral", NA, "Agree", NA), Q4 = c(NA, 
    NA, NA, NA, "Agree", NA))

rows <- apply(data[3:6], 1, \(x) all(x[!is.na(x)][1] == x[!is.na(x)][-1]) )

data[!rows,]
#>   ID  Time    Q1       Q2      Q3   Q4
#> 1  1 2 min Agree     <NA> Neutral <NA>
#> 2  2 5 min  <NA> Disagree   Agree <NA>
#> 3  3 3 min Agree     <NA> Neutral <NA>