Home > Blockchain >  How to order and mark duplicated rows at the same time
How to order and mark duplicated rows at the same time

Time:02-28

I am looking to make a new variable to mark which of my data is duplicated, selecting the oldest datapoint to be the "original". My dataframe is ordered by date, but by ID.

ID      Name        Number         Datetime (dd/mm/yyy/hh/MM)
1       ace         114            15.03.2019 15:26
2       bert        197            18.03.2019 07:28
3       vance       245            16.03.2019 14:03
4       chad        116            17.03.2019 02:02
5       chad        116            18.03.2019 18:23
6       ace         114            12.03.2019 23:15

Ordering the dataframe works and selecting the duplicated lines also works, but not in combination, which leads to the originals not being the first presentation. Even if I order the dataframe before marking the represenation the dataframe is seems to be unordered for the next command and linking the two commands with %>% is not working.

df %>% arrange(Datetime) 
df$representations <- if_else(duplicated(df$number, .keep_all =TRUE), 1, 0)

df$represntations <- df %>%
   arrange(Datetime) %>%
   if_else(duplicated(df$number, .keep_all =TRUE), 1, 0)

How can i be sure, that the the originals will be the first datapoint to the number (like this)?

ID      Name        Number         Datetime (dd/mm/yyy/hh/MM)  representation
1       ace         114            15.03.2019 15:26            1
2       bert        197            18.03.2019 07:28            0
3       vance       245            16.03.2019 14:03            0
4       chad        116            17.03.2019 02:02            0
5       chad        116            18.03.2019 18:23            1
6       ace         114            12.03.2019 23:15            0

CodePudding user response:

Try the below code

df <- df %>%
   arrange(Datetime) %>%
   mutate(representations = if_else(duplicated(number, .keep_all =TRUE), 1, 0))  %>% 
   arrange(ID)

CodePudding user response:

library(dplyr)
df %>% 
  arrange(`Datetime(dd/mm/yyy/hh/MM)`) %>% 
  mutate(flag = duplicated(Number)*1) %>% 
  arrange(ID)
  1 ace      114 15.03.2019                      1
2     2 bert     197 18.03.2019                      0
3     3 vance    245 16.03.2019                      0
4     4 chad     116 17.03.2019                      0
5     5 chad     116 18.03.2019                      1
6     6 ace      114 12.03.2019                      0

CodePudding user response:

I ended up using this code and the sample I checked seemed to be correct, thank you! (even though the as.Date changed the year from 2019 to 2020, but the order is correct)

# split time and date, so as.Date can be used 
emerge$date <- as.Date(sapply(strsplit(as.character(emerge$Falleinzeitdatum.Notfall), " "), "[", 1), format = "%d.%m.%y")

# arrange as proposed
emerge <- emerge %>% 
  arrange(date) %>%
  mutate(re = if_else(duplicated(Patientennummer, .keep_all = TRUE), 1, 0))
  • Related