Home > Blockchain >  Create new column with unique ID to handle duplicates from 2 columns in R
Create new column with unique ID to handle duplicates from 2 columns in R

Time:06-24

I have a dataframe consisting of surveys that are done for specific subjects on specific days. Looking something like this:

subject_id day data
01 1 34
02 1 54
03 3 55
04 4 56

However, sometimes the survey is done two times in a day, resulting in duplicate Subject_id and Day values, but different data (in reality the data spans multiple collumns).

subject_id day data
01 1 34
01 1 58
02 1 54
03 3 55
04 4 56

The dataframe is sorted by time, such that the second row represents the second survey on day 1 for subject 1. What I would like to do is to add a column called name that adds "- 2" to every second duplicate subject_id and day and adds nothing (or "- 1" is also acceptable) for non duplicates. In the end the data should like this.

subject_id day data name
01 1 34 1
01 1 58 1 -2
02 1 54 1
03 3 55 3
04 4 56 4

I have tried using dplyr and janitor to mark the duplicates. However after many hours I still failed to make the new column as described.

CodePudding user response:

library(data.table)
library(tidyverse)

df <- fread("subject_id day data
01  1   34
01  1   58
02  1   54
03  3   55
04  4   56") %>% 
  tibble()

df
#> # A tibble: 5 × 3
#>   subject_id   day  data
#>        <int> <int> <int>
#> 1          1     1    34
#> 2          1     1    58
#> 3          2     1    54
#> 4          3     3    55
#> 5          4     4    56

df %>% 
  group_by(subject_id) %>% 
  mutate(name = ifelse(order(day) == 1, 
                       as.character(order(day)), 
                       paste(day, "-", order(day)))
         )
#> # A tibble: 5 × 4
#> # Groups:   subject_id [4]
#>   subject_id   day  data name 
#>        <int> <int> <int> <chr>
#> 1          1     1    34 1    
#> 2          1     1    58 1 - 2
#> 3          2     1    54 1    
#> 4          3     3    55 1    
#> 5          4     4    56 1

Created on 2022-06-23 by the reprex package (v2.0.1)

CodePudding user response:

Try this

df |> group_by(subject_id , day) |> 
   mutate(name = ifelse( duplicated(day) , paste0(day , "-" , cumsum(duplicated(day)) 1)
 , as.character(day)))

CodePudding user response:

If survey are your data as data.frame

duplicatedresults<-duplicated(surveys$subject_id)
surveys$name[duplicatedresults]<-paste0(surveys$subject_id[duplicatedresults],"-2")

CodePudding user response:

df %>%
  group_by(day) %>%
  mutate(name = janitor::make_clean_names(subject_id, use_make_names = FALSE))


# A tibble: 5 x 4
# Groups:   day [3]
  subject_id   day  data name 
       <int> <int> <int> <chr>
1          1     1    34 1    
2          1     1    58 1_2  
3          2     1    54 2    
4          3     3    55 3    
5          4     4    56 4 
  • Related