Filtering based on the first insurance ID-CodePudding

I have a list of patients from my insurance company, for some reason sometimes they are given a new policy number, but it should not, it's the oldest insurance number the one that matters, here is an example.

InsuranceNumber <- c("00932", "00932", "00932", "00987", "00987", "00915", "00915", "00923" , "00977")
PatientName <- c("Patient1", "Patient1", "Patient1", "Patient1", "Patient1", "Patient1", "Patient1", "Patient2", "Patient2")

df <- data.frame(InsuranceNumber, PatientName)

df

InsuranceNumber            PatientName
00932                      Patient1 
00932                      Patient1 
00932                      Patient1  
00987                      Patient1
00987                      Patient1
00915                      Patient1
00915                      Patient1
00923                      Patient2
00977                      Patient2

I need a code in R that returns a data frame that, only takes the first Insurance number and removes all others, it should be like this

   "InsuranceNumber"          "PatientName"
    00932                      Patient1 
    00932                      Patient1 
    00932                      Patient1
    00923                      Patient2

Thanks in advance

CodePudding user response：

library(dplyr)

df %>% 
  group_by(PatientName) %>% 
  filter(InsuranceNumber == first(InsuranceNumber))

# A tibble: 4 x 2
# Groups:   PatientName [2]
  InsuranceNumber PatientName
  <chr>           <chr>      
1 00932           Patient1   
2 00932           Patient1   
3 00932           Patient1   
4 00923           Patient2

CodePudding user response：

A base R option using ave -

result <- subset(df, InsuranceNumber == ave(InsuranceNumber, PatientName, 
                      FUN = function(x) x[1]))

result

#   InsuranceNumber PatientName
#1           00932    Patient1
#2           00932    Patient1
#3           00932    Patient1
#8           00923    Patient2

CodePudding user response：

We can use subset from base Rwith %in% (or ==) ones that are not duplicated

subset(df, InsuranceNumber %in% InsuranceNumber[!duplicated(PatientName)])
  InsuranceNumber PatientName
1           00932    Patient1
2           00932    Patient1
3           00932    Patient1
8           00923    Patient2

NOTE: we are not using any group by approach and should be faster