Home > Mobile >  "Marking" Duplicates in R
"Marking" Duplicates in R

Time:09-16

I am working with the R programming language. Suppose I have the following data:

 Data_I_Have <- data.frame(
        
        "Person" = c("John", "John", "John", "Peter", "Peter", "Peter", "Tim", "Kevin", "Adam", "Adam", "Xavier"),
        "Number_of_Kids" = c("4", "1", "1", "5", "2", "3", "7", "0", "3", "3", "5")
     
    )

  Person Number_of_Kids
1    John              4
2    John              1
3    John              1
4   Peter              5
5   Peter              2
6   Peter              3
7     Tim              7
8   Kevin              0
9    Adam              3
10   Adam              3
11 Xavier              5

Is it possible to "mark" each duplicate name so that it looks like the file below (e.g. John_1, John_2, etc.)?

Data_I_Want <- data.frame(
    
    "Person" = c("John_1", "John_2", "John_3", "Peter_1", "Peter_2", "Peter_3", "Tim", "Kevin", "Adam_1", "Adam_2", "Xavier"),
   "Number_of_Kids" = c("4", "1", "1", "5", "2", "3", "7", "0", "3", "3", "5")
 
)

   Person Number_of_Kids
1   John_1              4
2   John_2              1
3   John_3              1
4  Peter_1              5
5  Peter_2              2
6  Peter_3              3
7      Tim              7
8    Kevin              0
9   Adam_1              3
10  Adam_2              3
11  Xavier              5

Using this previous question Add specific characters to duplicated strings, I tried to follow the method used there:

Data_I_Want <-  make.unique(Data_I_Have, sep = '_')

But this gives me the following error:

Error in make.unique(Data_I_Have, sep = "_") : 
  'names' must be a character vector

Can someone please show me how to fix this?

Thanks!

CodePudding user response:

make.unique expects a vector and not a data.frame and the output by default will append 1, 2, 3 with . (as sep only from the duplicate values and not from the start. i.e.

> make.unique(Data_I_Have$Person)
 [1] "John"    "John.1"  "John.2"  "Peter"   "Peter.1" "Peter.2" "Tim"     "Kevin"   "Adam"    "Adam.1"  "Xavier" 

If we want to get the desired output, group by 'Person', and concatenate the row_number() with the group column and then ungroup() it.

library(dplyr)
library(stringr)
Data_I_Have %>%
    group_by(Person) %>% 
    mutate(Person = case_when(n() > 1 ~
        str_c(Person, "_", row_number()), TRUE ~ Person)) %>% 
    ungroup()

-output

# A tibble: 11 x 2
   Person  Number_of_Kids
   <chr>   <chr>         
 1 John_1  4             
 2 John_2  1             
 3 John_3  1             
 4 Peter_1 5             
 5 Peter_2 2             
 6 Peter_3 3             
 7 Tim     7             
 8 Kevin   0             
 9 Adam_1  3             
10 Adam_2  3             
11 Xavier  5     

CodePudding user response:

Here is an alternative way (update, corrected version):

library(dplyr)
Data_I_Have %>% 
    group_by(Person) %>% 
    mutate(id = row_number(),
           n = n(),
           Person = ifelse(n >1, paste(Person, id, sep="_"), Person)) %>% 
    select(-id, -n)
   Person  Number_of_Kids
   <chr>   <chr>         
 1 John_1  4             
 2 John_2  1             
 3 John_3  1             
 4 Peter_1 5             
 5 Peter_2 2             
 6 Peter_3 3             
 7 Tim     7             
 8 Kevin   0             
 9 Adam_1  3             
10 Adam_2  3             
11 Xavier  5    

CodePudding user response:

A tidyverse option using cumsum().

library(dplyr)

Data_I_Have %>%
  group_by(Person) %>%
  mutate(cnt = 1,
         Person = str_c(Person, cumsum(cnt), sep = '_')) %>%
  ungroup() %>%
  select(-cnt)

# # A tibble: 11 x 2
#   Person    Number_of_Kids
#   <chr>     <chr>         
# 1 John_1    4             
# 2 John_2    1             
# 3 John_3    1             
# 4 Peter_1   5             
# 5 Peter_2   2             
# 6 Peter_3   3             
# 7 Tim_1     7             
# 8 Kevin_1   0             
# 9 Adam_1    3             
# 10 Adam_2   3             
# 11 Xavier_1 5  

CodePudding user response:

Using data.table, where .N is a symbol for the number of rows in the group.

Data_I_Have <- data.frame(
  "Person" = c("John", "John", "John", "Peter", "Peter", "Peter", "Tim", "Kevin", "Adam", "Adam", "Xavier"),
  "Number_of_Kids" = c("4", "1", "1", "5", "2", "3", "7", "0", "3", "3", "5")
)

library(data.table)
setDT(Data_I_Have)

Data_I_Have[, Person := if (.N == 1) Person else paste0(Person, "_", seq(.N)),
            by = Person]

Data_I_Have
#>      Person Number_of_Kids
#>  1:  John_1              4
#>  2:  John_2              1
#>  3:  John_3              1
#>  4: Peter_1              5
#>  5: Peter_2              2
#>  6: Peter_3              3
#>  7:     Tim              7
#>  8:   Kevin              0
#>  9:  Adam_1              3
#> 10:  Adam_2              3
#> 11:  Xavier              5

Created on 2021-09-15 by the reprex package (v2.0.1)

  • Related