Based on the data and code below how can I replace duplicates by adding a
, b
, c
and so on for each duplicate value except the first one?
Please note that in the actual data provided to me, there are thousands of entries, so there could be any number of duplicates thus, it would be hard for me to manually find each and every duplicate value in the data. So, I don't know if it might be a problem in not identifying the duplicates first before replacing them. Maybe there is a way to identify them first, which as of now I don't know.
Code:
# Sample data
df = structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 35555, 35555, 35555
), year = c(2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022,
2022, 2022)), class = "data.frame", row.names = c(NA, -11L))
# Desired output
df = structure(list(id = c(1, "1a", "1b", "1c", 2, "2a", "2b", "2c", 35555, "35555a", "35555b"
), year = c(2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022,
2022, 2022)), class = "data.frame", row.names = c(NA, -11L))
# Replace/modify duplicates
CodePudding user response:
This will work as long as none of your runs is longer than 27 (blank letters of the alphabet). If you have longer runs, replace letters
with a longer array of your choice.
df$id <- unlist(tapply(df$id, df$id, function(id) paste0(id, c("", letters)[1:length(id)])))
df
id year
1 1 2022
2 1a 2022
3 1b 2022
4 1c 2022
5 2 2022
6 2a 2022
7 2b 2022
8 2c 2022
9 35555 2022
10 35555a 2022
11 35555b 2022
CodePudding user response:
Here's an option:
library(dplyr)
library(data.table)
# Sample data
df = structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 35555, 35555, 35555
), year = c(2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022,
2022, 2022)), class = "data.frame", row.names = c(NA, -11L))
df %>%
group_by(id, year) %>%
mutate(tmp = data.table::rleid(row_number()),
id = paste0(id, letters[tmp])) %>%
select(-tmp)
#> # A tibble: 11 × 2
#> # Groups: id, year [11]
#> id year
#> <chr> <dbl>
#> 1 1a 2022
#> 2 1b 2022
#> 3 1c 2022
#> 4 1d 2022
#> 5 2a 2022
#> 6 2b 2022
#> 7 2c 2022
#> 8 2d 2022
#> 9 35555a 2022
#> 10 35555b 2022
#> 11 35555c 2022
CodePudding user response:
If the suffix doesn't matter, make.unique
does this automatically
library(dplyr)
df %>%
mutate(id = make.unique(as.character(id)))
CodePudding user response:
If the suffix does matter, with paste0
and rowid
:
transform(df,
new_id = paste0(id, c("", letters)[data.table::rowid(id)])
)
# id year new_id
# 1 1 2022 1
# 2 1 2022 1a
# 3 1 2022 1b
# 4 1 2022 1c
# 5 2 2022 2
# 6 2 2022 2a
# 7 2 2022 2b
# 8 2 2022 2c
# 9 35555 2022 35555
# 10 35555 2022 35555a
# 11 35555 2022 35555b
CodePudding user response:
One more:
library(dplyr)
df %>%
add_count(id) %>%
mutate(id = paste0 (id, lag(letters[1:n], default = "")), .keep="unused")
id year
1 1 2022
2 1a 2022
3 1b 2022
4 1c 2022
5 2 2022
6 2a 2022
7 2b 2022
8 2c 2022
9 35555 2022
10 35555a 2022
11 35555b 2022