I have a large data frame with a column of 78,000 names. I want to go through the column, compare the names in one row to another, and if there are duplicates, I want to append a number to all duplicate values, starting with 1,2,3, etc.
df <- data.frame("Promoter"=c("Xkr4_promoter",
"Gm18956_promoter",
"Gm19938_promoter",
"Gm37381_promoter", "Gm37381_promoter",
"Rp1_promoter", "Rp1_promoter",
"Gm6101_promoter",
"Gm37483_promoter",
"Sox17_promoter", "Sox17_promoter", "Sox17_promoter", "Sox17_promoter"))
Starting data:
Promoter |
---|
Xkr4_promoter |
Gm18956_promoter |
Gm19938_promoter |
Gm37381_promoter |
Gm37381_promoter |
Rp1_promoter |
Rp1_promoter |
Gm6101_promoter |
Gm37483_promoter |
Sox17_promoter |
Sox17_promoter |
Sox17_promoter |
Sox17_promoter |
Expected output data:
Promoter |
---|
Xkr4_promoter |
Gm18956_promoter |
Gm19938_promoter |
Gm37381_promoter_1 |
Gm37381_promoter_2 |
Rp1_promoter_1 |
Rp1_promoter_2 |
Gm6101_promoter |
Gm37483_promoter |
Sox17_promoter_1 |
Sox17_promoter_2 |
Sox17_promoter_3 |
Sox17_promoter_4 |
I have tried multiple variations of for loops
, where I tried to compare the values and if they were the same, add to a count and then paste()
the count to the value in the column. See below my most recent attempt:
for (i in 1:nrow(df)) {
count <- 0
if ((df[i,1] == lag(df[i,1]))) {
count <- count 1
df$Promoter <- paste(df[i,1], count, sep=".")
}
else {
count <- 0
}
}
However I end up with this error that I haven't figured out how to solve:
Error in if ((df[i, 1] == lag(df[i, 1]))) { :missing value where TRUE/FALSE needed
I think I might need to use mutate()
in dplyr
, using lead()
or lag()
but I haven't quite figured out how to put them into a working order.
Thank you for any help!
CodePudding user response:
R supports vectorised functions, so usually you can get away with not implementing a for-loop.
A possible approach is identifying duplicates with duplicated()
, calculating the cumulative sum of this, and then pasting this integer to the original Promoter
:
library(dplyr)
df |>
mutate(dupl = if_else(duplicated(Promoter), 1, 0)) |>
group_by(Promoter) |>
mutate(dupl = cumsum(dupl),
Promoter = paste(Promoter, dupl, sep = "_")) |>
select(-dupl)
Output
#> # A tibble: 13 x 1
#> # Groups: Promoter [13]
#> Promoter
#> <chr>
#> 1 Xkr4_promoter_0
#> 2 Gm18956_promoter_0
#> 3 Gm19938_promoter_0
#> 4 Gm37381_promoter_0
#> 5 Gm37381_promoter_1
#> 6 Rp1_promoter_0
#> 7 Rp1_promoter_1
#> 8 Gm6101_promoter_0
#> 9 Gm37483_promoter_0
#> 10 Sox17_promoter_0
#> 11 Sox17_promoter_1
#> 12 Sox17_promoter_2
#> 13 Sox17_promoter_3
Created on 2022-06-21 by the reprex package (v2.0.1)