I have the following dataframe:
id children age
1 AB22 2 31
2 AB25 1 27
3 AB25 1 25
4 AB25 3 33
5 AB50 2 40
6 AB35 1 37
7 AB35 1 39
8 AB40 1 41
I have to check if there are duplicate records in the id field. The idea is to assign a new id to the duplicate records (without repeating them), so that their numerical part has to be a value greater than the maximum numerical value of the id field. In the example, the maximum value is 50 (row 5). Therefore, the dataframe could be as follows (or any number greater than the maximum):
id children age
1 AB22 2 31
2 AB25 1 27
3 AB51 1 25
4 AB52 3 33
5 AB50 2 40
6 AB35 1 37
7 AB53 1 39
8 AB40 1 41
How I could achieve it?
CodePudding user response:
Using dplyr
and stringr
:
data %>%
mutate(dupli = cumsum(as.numeric(duplicated(id)))) %>%
mutate(id = ifelse(duplicated(id),
paste("AB",max(as.numeric(str_remove_all(id,"^AB"))) dupli,sep=""),
id)) %>%
select(-dupli)
Output:
id children age
1 AB22 2 31
2 AB25 1 27
3 AB51 1 25
4 AB52 3 33
5 AB50 2 40
6 AB35 1 37
7 AB53 1 39
8 AB40 1 41
Data:
structure(list(id = c("AB22", "AB25", "AB25", "AB25", "AB50",
"AB35", "AB35", "AB40"), children = c(2L, 1L, 1L, 3L, 2L, 1L,
1L, 1L), age = c(31L, 27L, 25L, 33L, 40L, 37L, 39L, 41L)), row.names = c(NA,
-8L), class = "data.frame")
CodePudding user response:
Try
df$id1=gsub("[^a-zA-Z]", "", df$id)
df$id2=as.numeric(gsub("\\D", "", df$id))
tmp=duplicated(df$id)
df[tmp,"id"]=paste0(df$id1[tmp],seq(max(df$id2) 1,max(df$id2) sum(tmp)))
id children age id1 id2
1 AB22 2 31 AB 22
2 AB25 1 27 AB 25
3 AB51 1 25 AB 25
4 AB52 3 33 AB 25
5 AB50 2 40 AB 50
6 AB35 1 37 AB 35
7 AB53 1 39 AB 35
8 AB40 1 41 AB 40