Home > OS >  How can I remove all rows of a dataframe that have the same string value across a subset of columns
How can I remove all rows of a dataframe that have the same string value across a subset of columns

Time:12-29

I have a dataframe that looks like this:

ID Time Q1 Q2 Q3 Q4
1 2 min Agree NA Neutral NA
2 5 min NA Disagree Agree NA
3 3 min Agree NA Neutral NA
4 5 min Disagree Disagree NA NA
5 6 min NA Agree Agree Agree
6 1 min NA NA NA NA

I want to retain only the rows for which responses to the questions are not equivalent across Question columns (Q1:Q4). In this example, I would retain rows for IDs 1-3 and remove rows 4-6, as they are all identical strings. I do want to keep the information from the first two columns, but I don't want to use it in the decision logic regarding whether or not to keep the row. All rows have NAs, but the NAs are in various places - so I want to remove rows for which all columns that have any value in them are the same, or rows for which all values across columns are missing.

I found another answer that did something similar and tried this:

keep <- apply(df[3:6], 1, function(x) length(unique(x[!is.na(x)])) != 1)
df[keep, ]

but that seemed to only remove rows that were all NAs.

CodePudding user response:

To keep all those with discordant responses (assumed ignoring NAs), you can try this slight change in your attempted code:

keeps <- apply(df[3:6], 1, function(x) !length(unique(x[!is.na(x)])) %in% 0:1)
df[keeps, ]

#    ID Time    Q1       Q2      Q3   Q4
# 1  1 2min Agree     <NA> Neutral <NA>
# 2  2 5min  <NA> Disagree   Agree <NA>
# 3  3 3min Agree     <NA> Neutral <NA>

If you want all those with identical responses (assumed ignoring NAs):

keeps <- apply(df[3:6], 1, function(x) length(unique(x[!is.na(x)])) == 1)
df[keeps, ]

#    ID Time       Q1       Q2    Q3    Q4
# 4  4 5min Disagree Disagree  <NA>  <NA>
# 5  5 6min     <NA>    Agree Agree Agree

Data

df <- read.table(text = "ID Time    Q1  Q2  Q3  Q4
1   2min    Agree   NA  Neutral NA
2   5min    NA  Disagree    Agree   NA
3   3min    Agree   NA  Neutral NA
4   5min    Disagree    Disagree    NA  NA
5   6min    NA  Agree   Agree   Agree
6   1min    NA  NA  NA  NA", header = TRUE)

CodePudding user response:

Here's a dplyr-based solution: group using rowwise(), then filter to rows with > 1 distinct value across columns.

library(dplyr)

df %>% 
  rowwise() %>% 
  filter(n_distinct(c_across(Q1:Q4), na.rm = TRUE) > 1) %>% 
  ungroup()
# A tibble: 3 × 6
     ID Time  Q1    Q2       Q3      Q4   
  <int> <chr> <chr> <chr>    <chr>   <chr>
1     1 2min  Agree NA       Neutral NA   
2     2 5min  NA    Disagree Agree   NA   
3     3 3min  Agree NA       Neutral NA   

CodePudding user response:

data <- data.frame(ID = 1:6, Time = c("2 min", "5 min", "3 min", 
"5 min", "6 min", "1 min"), Q1 = c("Agree", NA, "Agree", "Disagree", 
NA, NA), Q2 = c(NA, "Disagree", NA, "Disagree", "Agree", NA), 
    Q3 = c("Neutral", "Agree", "Neutral", NA, "Agree", NA), Q4 = c(NA, 
    NA, NA, NA, "Agree", NA))

rows <- apply(data[3:6], 1, \(x) all(x[!is.na(x)][1] == x[!is.na(x)][-1]) )

data[!rows,]
#>   ID  Time    Q1       Q2      Q3   Q4
#> 1  1 2 min Agree     <NA> Neutral <NA>
#> 2  2 5 min  <NA> Disagree   Agree <NA>
#> 3  3 3 min Agree     <NA> Neutral <NA>
  • Related