Home > Software engineering >  Subsetting rows which contain more than one entry across selection of columns
Subsetting rows which contain more than one entry across selection of columns

Time:10-16

I'm trying to subset rows which contain Y in multiple columns while excluding rows which only contain Y in one column.

A solution in R is desired.

This is what my data looks like:

enter image description here

My desired output in R would look like this: enter image description here

Thanks in advance

CodePudding user response:

library(dplyr)

df <- tibble(w = rbinom(100, 1, 0.5), x = rbinom(100, 1, 0.5), y = rbinom(100, 1, 0.5)) |> 
  mutate(across(everything(), ~ if_else(.x == 1, "y", NA_character_)))

count_if <- function(x) { if_else(x == "y", 1, 0)}

df |> 
  rowwise() |> 
  mutate(count = sum(count_if(c_across(everything())), na.rm = TRUE)) |> 
  ungroup() |>
  filter(count > 1)

# A tibble: 51 × 4
   w     x     y     count
   <chr> <chr> <chr> <dbl>
 1 y     NA    y         2
 2 y     y     NA        2
 3 NA    y     y         2
 4 y     y     y         3
 5 NA    y     y         2
 6 y     y     NA        2
 7 y     y     NA        2
 8 y     y     NA        2
 9 NA    y     y         2
10 NA    y     y         2

CodePudding user response:

The idea is to convert your matrix to a numeric matrix containing ones and zeroes, sum the rows, and take only those rows whose sum is 1. See below. I don't recommend defining your data this way; it was just the quickest way I could reproduce your data.

rowA=c(' ','Y',' ',' ','Y',' ',' ',' ',' ','Y')
rowB=c(' ',' ','Y',' ',' ',' ',' ',' ',' ',' ')
rowC=c(' ',' ',' ','Y',' ',' ',' ',' ',' ',' ')
rowD=c(' ',' ','Y',' ',' ','Y',' ','Y','Y',' ')
rowE=c(' ','Y','Y',' ',' ',' ',' ',' ',' ',' ')
rowF=c(' ',' ',' ','Y',' ',' ',' ',' ',' ',' ')
my.data=rbind(rowA,rowB,rowC,rowD,rowE,rowF)
rownames(my.data)=LETTERS[1:6]
step2=ifelse(my.data=='Y',1,0)
v=rowSums(step2)
my.data[v>1,]

Gives the following output:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
A " "  "Y"  " "  " "  "Y"  " "  " "  " "  " "  "Y"  
D " "  " "  "Y"  " "  " "  "Y"  " "  "Y"  "Y"  " "  
E " "  "Y"  "Y"  " "  " "  " "  " "  " "  " "  " "  

You could also avoid creating meaningless variables by combining into fewer lines.

my.data[rowSums(ifelse(my.data=='Y',1,0))>1,]
  • Related