With use of R I want to prepare my data for analysis, and I want to select only cows that have been mated with both a X (= breed 1) bull, and a Y (= breed 2) bull. For now my data looks as follows:
Cow | Parity | Bullbreed |
---|---|---|
1 | 1 | X |
1 | 2 | X |
1 | 3 | Y |
2 | 1 | X |
2 | 2 | X |
2 | 3 | X |
3 | 1 | X |
3 | 2 | Y |
3 | 3 | Y |
4 | 1 | Y |
4 | 2 | Y |
4 | 3 | Y |
Cow 1 and 3 have been pregnant with two different bullbreeds, whereas cow 2 and 4 have only been pregnant with one type of bullbreed. I therefore want to take cow 2 and cow 4 (and all other animals that have been pregnant with only one type of bullbreed) out of my data to make it look like this:
Cow | Parity | Bullbreed |
---|---|---|
1 | 1 | X |
1 | 2 | X |
1 | 3 | Y |
3 | 1 | X |
3 | 2 | Y |
3 | 3 | Y |
In my real dataset I also only have two types of bullbreeds, but cownumbers are more specified instead of 1, 2, 3, 4, ..., N.
Is there an easy way to do this selection?
I tried checking cows pregnant by only one bullbreed 'by hand', but my data exists of over 600,000 rows. Therefore first checking which animals only have been pregnant with only breed X or Y, and then deleting those out of the data takes too long.
CodePudding user response:
Using dplyr::n_distinct
you could do:
library(dplyr)
dat |>
group_by(Cow) |>
filter(n_distinct(Bullbreed) > 1) |>
ungroup()
#> # A tibble: 6 × 3
#> Cow Parity Bullbreed
#> <int> <int> <chr>
#> 1 1 1 X
#> 2 1 2 X
#> 3 1 3 Y
#> 4 3 1 X
#> 5 3 2 Y
#> 6 3 3 Y
DATA
dat <- data.frame(
Cow = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
Parity = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L),
Bullbreed = c("X","X","Y","X","X","X",
"X","Y","Y","Y","Y","Y")
)