Home > Software engineering >  Replace/modify duplicates in a dataframe
Replace/modify duplicates in a dataframe

Time:10-14

Based on the data and code below how can I replace duplicates by adding a, b, c and so on for each duplicate value except the first one?

Please note that in the actual data provided to me, there are thousands of entries, so there could be any number of duplicates thus, it would be hard for me to manually find each and every duplicate value in the data. So, I don't know if it might be a problem in not identifying the duplicates first before replacing them. Maybe there is a way to identify them first, which as of now I don't know.

Code:

# Sample data
df = structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 35555, 35555, 35555
), year = c(2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 
2022, 2022)), class = "data.frame", row.names = c(NA, -11L))

# Desired output
df = structure(list(id = c(1, "1a", "1b", "1c", 2, "2a", "2b", "2c", 35555, "35555a", "35555b"
), year = c(2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 
2022, 2022)), class = "data.frame", row.names = c(NA, -11L))

# Replace/modify duplicates

CodePudding user response:

This will work as long as none of your runs is longer than 27 (blank letters of the alphabet). If you have longer runs, replace letters with a longer array of your choice.

df$id <- unlist(tapply(df$id, df$id, function(id) paste0(id, c("", letters)[1:length(id)])))

 df
       id year
1       1 2022
2      1a 2022
3      1b 2022
4      1c 2022
5       2 2022
6      2a 2022
7      2b 2022
8      2c 2022
9   35555 2022
10 35555a 2022
11 35555b 2022

CodePudding user response:

Here's an option:

library(dplyr)
library(data.table)

# Sample data
df = structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 35555, 35555, 35555
), year = c(2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 
            2022, 2022)), class = "data.frame", row.names = c(NA, -11L))

df %>% 
  group_by(id, year) %>% 
  mutate(tmp = data.table::rleid(row_number()),
         id = paste0(id, letters[tmp])) %>% 
  select(-tmp)
  
#> # A tibble: 11 × 2
#> # Groups:   id, year [11]
#>    id      year
#>    <chr>  <dbl>
#>  1 1a      2022
#>  2 1b      2022
#>  3 1c      2022
#>  4 1d      2022
#>  5 2a      2022
#>  6 2b      2022
#>  7 2c      2022
#>  8 2d      2022
#>  9 35555a  2022
#> 10 35555b  2022
#> 11 35555c  2022

CodePudding user response:

If the suffix doesn't matter, make.unique does this automatically

library(dplyr)
df %>% 
   mutate(id = make.unique(as.character(id)))

CodePudding user response:

If the suffix does matter, with paste0 and rowid:

transform(df,
          new_id = paste0(id, c("", letters)[data.table::rowid(id)])
          )

#       id year new_id
# 1      1 2022      1
# 2      1 2022     1a
# 3      1 2022     1b
# 4      1 2022     1c
# 5      2 2022      2
# 6      2 2022     2a
# 7      2 2022     2b
# 8      2 2022     2c
# 9  35555 2022  35555
# 10 35555 2022 35555a
# 11 35555 2022 35555b

CodePudding user response:

One more:

library(dplyr)

df %>% 
  add_count(id) %>% 
  mutate(id = paste0 (id, lag(letters[1:n], default = "")), .keep="unused")
       id year
1       1 2022
2      1a 2022
3      1b 2022
4      1c 2022
5       2 2022
6      2a 2022
7      2b 2022
8      2c 2022
9   35555 2022
10 35555a 2022
11 35555b 2022
  • Related