Home > Enterprise >  Iterate through a column for duplicates, append a number to duplicate column value in R
Iterate through a column for duplicates, append a number to duplicate column value in R

Time:06-21

I have a large data frame with a column of 78,000 names. I want to go through the column, compare the names in one row to another, and if there are duplicates, I want to append a number to all duplicate values, starting with 1,2,3, etc.

df <- data.frame("Promoter"=c("Xkr4_promoter", 
                              "Gm18956_promoter", 
                              "Gm19938_promoter", 
                              "Gm37381_promoter", "Gm37381_promoter", 
                              "Rp1_promoter", "Rp1_promoter", 
                              "Gm6101_promoter", 
                              "Gm37483_promoter",  
                              "Sox17_promoter", "Sox17_promoter", "Sox17_promoter", "Sox17_promoter")) 

Starting data:

Promoter
Xkr4_promoter
Gm18956_promoter
Gm19938_promoter
Gm37381_promoter
Gm37381_promoter
Rp1_promoter
Rp1_promoter
Gm6101_promoter
Gm37483_promoter
Sox17_promoter
Sox17_promoter
Sox17_promoter
Sox17_promoter

Expected output data:

Promoter
Xkr4_promoter
Gm18956_promoter
Gm19938_promoter
Gm37381_promoter_1
Gm37381_promoter_2
Rp1_promoter_1
Rp1_promoter_2
Gm6101_promoter
Gm37483_promoter
Sox17_promoter_1
Sox17_promoter_2
Sox17_promoter_3
Sox17_promoter_4

I have tried multiple variations of for loops, where I tried to compare the values and if they were the same, add to a count and then paste() the count to the value in the column. See below my most recent attempt:

for (i in 1:nrow(df)) {
  count <- 0
  if ((df[i,1] == lag(df[i,1]))) {
  count <- count 1
  df$Promoter <- paste(df[i,1], count, sep=".")
  }
  else {
    count <- 0
  }
}

However I end up with this error that I haven't figured out how to solve: Error in if ((df[i, 1] == lag(df[i, 1]))) { :missing value where TRUE/FALSE needed

I think I might need to use mutate() in dplyr, using lead() or lag() but I haven't quite figured out how to put them into a working order.

Thank you for any help!

CodePudding user response:

R supports vectorised functions, so usually you can get away with not implementing a for-loop.

A possible approach is identifying duplicates with duplicated(), calculating the cumulative sum of this, and then pasting this integer to the original Promoter:

library(dplyr)

df |> 
        mutate(dupl = if_else(duplicated(Promoter), 1, 0)) |> 
        group_by(Promoter) |> 
        mutate(dupl = cumsum(dupl),
               Promoter = paste(Promoter, dupl, sep = "_")) |> 
        select(-dupl)

Output

#> # A tibble: 13 x 1
#> # Groups:   Promoter [13]
#>    Promoter          
#>    <chr>             
#>  1 Xkr4_promoter_0   
#>  2 Gm18956_promoter_0
#>  3 Gm19938_promoter_0
#>  4 Gm37381_promoter_0
#>  5 Gm37381_promoter_1
#>  6 Rp1_promoter_0    
#>  7 Rp1_promoter_1    
#>  8 Gm6101_promoter_0 
#>  9 Gm37483_promoter_0
#> 10 Sox17_promoter_0  
#> 11 Sox17_promoter_1  
#> 12 Sox17_promoter_2  
#> 13 Sox17_promoter_3

Created on 2022-06-21 by the reprex package (v2.0.1)

  • Related