I have a dataset (ds) with two columns. There are either one or two entries with the same number in "match". "status" is a binary variable. There are pairs, for example, the value 12 in match appears twice, one for where status is 1 and 0 for the other. Yet, there are also observations in match who do not have a partner, in this dataset it would be 3,8,33,17 who have no partner.
match status
12 1
3 1
5 0
8 1
33 0
5 1
12 0
17 0
What I want to do is to create a new dataset that only contains observations of pairs (thus if a value appears twice). In my example, it would be
match status
12 1
12 0
5 0
5 1
The status variable in the final dataset would be 50/50 because a value in match (for example 12) has an observation where status = 0 and one where status = 1. The actual dataset I´m working with has over 50k observations so I cannot just search and filter by each number. What I tried is:
numbers <- table(ds$match)
numbers <- as.data.frame(numbers)
numbers <- numbers[numbers$Freq == 2,]
numbers <- numbers$Var1
ds$keep <- ifelse(numbers %in% ds$match, 1, 0)
Here I get the error "replacement has 23005 rows, data has 39021" If I could get around this error I think I could just run
ds <- filter(ds, ds$keep == 1)
to get the dataset that I want. This was my most promising approach. I tried a few other things but it always came done to the fact that the status variable wasn´t 50/50 so I couldn´t manage to exclude all observations without a pair. Does someone have an idea how I could fix my code or is there a solution that would be quicker/more smooth? Thanks for any help in advance!
CodePudding user response:
library(dplyr)
ds %>% group_by(match) %>% filter(n()>1) %>% arrange(match,status)
match status
<dbl> <dbl>
1 5 0
2 5 1
3 12 0
4 12 1
You can also do something like this:
ds <- ds[order(ds$match),]
id = rle(ds$match)
ds[ds$match %in% id$values[id$lengths>1],]
match status
<dbl> <dbl>
1 5 0
2 5 1
3 12 1
4 12 0