Home > Software engineering >  How to select data based on two conditions?
How to select data based on two conditions?

Time:01-05

With use of R I want to prepare my data for analysis, and I want to select only cows that have been mated with both a X (= breed 1) bull, and a Y (= breed 2) bull. For now my data looks as follows:

Cow Parity Bullbreed
1 1 X
1 2 X
1 3 Y
2 1 X
2 2 X
2 3 X
3 1 X
3 2 Y
3 3 Y
4 1 Y
4 2 Y
4 3 Y

Cow 1 and 3 have been pregnant with two different bullbreeds, whereas cow 2 and 4 have only been pregnant with one type of bullbreed. I therefore want to take cow 2 and cow 4 (and all other animals that have been pregnant with only one type of bullbreed) out of my data to make it look like this:

Cow Parity Bullbreed
1 1 X
1 2 X
1 3 Y
3 1 X
3 2 Y
3 3 Y

In my real dataset I also only have two types of bullbreeds, but cownumbers are more specified instead of 1, 2, 3, 4, ..., N.

Is there an easy way to do this selection?

I tried checking cows pregnant by only one bullbreed 'by hand', but my data exists of over 600,000 rows. Therefore first checking which animals only have been pregnant with only breed X or Y, and then deleting those out of the data takes too long.

CodePudding user response:

Using dplyr::n_distinct you could do:

library(dplyr)

dat |> 
  group_by(Cow) |> 
  filter(n_distinct(Bullbreed) > 1) |> 
  ungroup()
#> # A tibble: 6 × 3
#>     Cow Parity Bullbreed
#>   <int>  <int> <chr>    
#> 1     1      1 X        
#> 2     1      2 X        
#> 3     1      3 Y        
#> 4     3      1 X        
#> 5     3      2 Y        
#> 6     3      3 Y

DATA

dat <- data.frame(
               Cow = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
            Parity = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L),
         Bullbreed = c("X","X","Y","X","X","X",
                       "X","Y","Y","Y","Y","Y")
)
  • Related