I have a list of patients from my insurance company, for some reason sometimes they are given a new policy number, but it should not, it's the oldest insurance number the one that matters, here is an example.
InsuranceNumber <- c("00932", "00932", "00932", "00987", "00987", "00915", "00915", "00923" , "00977")
PatientName <- c("Patient1", "Patient1", "Patient1", "Patient1", "Patient1", "Patient1", "Patient1", "Patient2", "Patient2")
df <- data.frame(InsuranceNumber, PatientName)
df
InsuranceNumber PatientName
00932 Patient1
00932 Patient1
00932 Patient1
00987 Patient1
00987 Patient1
00915 Patient1
00915 Patient1
00923 Patient2
00977 Patient2
I need a code in R that returns a data frame that, only takes the first Insurance number and removes all others, it should be like this
"InsuranceNumber" "PatientName"
00932 Patient1
00932 Patient1
00932 Patient1
00923 Patient2
Thanks in advance
CodePudding user response:
library(dplyr)
df %>%
group_by(PatientName) %>%
filter(InsuranceNumber == first(InsuranceNumber))
# A tibble: 4 x 2
# Groups: PatientName [2]
InsuranceNumber PatientName
<chr> <chr>
1 00932 Patient1
2 00932 Patient1
3 00932 Patient1
4 00923 Patient2
CodePudding user response:
A base R option using ave
-
result <- subset(df, InsuranceNumber == ave(InsuranceNumber, PatientName,
FUN = function(x) x[1]))
result
# InsuranceNumber PatientName
#1 00932 Patient1
#2 00932 Patient1
#3 00932 Patient1
#8 00923 Patient2
CodePudding user response:
We can use subset
from base R
with %in%
(or ==
) ones that are not duplicated
subset(df, InsuranceNumber %in% InsuranceNumber[!duplicated(PatientName)])
InsuranceNumber PatientName
1 00932 Patient1
2 00932 Patient1
3 00932 Patient1
8 00923 Patient2
NOTE: we are not using any group by approach and should be faster