Home > Blockchain >  Merge/combine rows with same ID and Date in R
Merge/combine rows with same ID and Date in R

Time:11-01

I have an excel database like below. The Excel database had option to enter only 3 drug details. Wherever there are more than 3 drugs, it has been entered into another row with PID and Date.

Is there a way I can merge the rows in R so that each patient's records will be in a single row? In the example below, I need to merge Row 1 & 2 and 4 & 6.

Thanks.

Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
1 11A 25/10/2021 RPG 12 NAT 34 QRT 5 45 PMk
2 11A 25/10/2021 BET 10 SET 43 BLT 45
3 12B 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
4 13A 22/10/2021 GGS 7 GSF 12 ERE 45 45 RKS
5 13A 26/10/2021 BRT 9 ARR 4 GSF 34 46 GLO
6 13A 22/10/2021 DFS 5
7 14B 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK

CodePudding user response:

Up front, the two methods below are completely different, not equivalents in "base R vs dplyr". I'm sure either can be translated to the other.

dplyr

The premise here is to first reshape/pivot the data longer so that each Drug/Dose is on its own line, renumber them appropriately, and then bring it back to a wide state.

NOTE: frankly, I usually prefer to deal with data in a long format, so consider keeping it in its state immediately before pivot_wider. This means you'd need to bring Age and Place back into it somehow.

Why? A long format deals very well with many types of aggregation; ggplot2 really really prefers data in the long format; I dislike seeing and having to deal with all of the NA/empty values that will invariably happen with this wide format, since many PIDs don't have (e.g.) Drug6 or later. This seems subjective, but it can really be an objective change/improvement to data-mangling, depending on your workflow.

library(dplyr)
# library(tidyr) # pivot_longer, pivot_wider
dat0 <- select(dat, PID, Date, Age, Place) %>%
  group_by(PID, Date) %>%
  summarize(across(everything(), ~ .[!is.na(.) & nzchar(trimws(.))][1] ))
dat %>%
  select(-Age, -Place) %>%
  tidyr::pivot_longer(
    -c(Row, PID, Date),
    names_to = c(".value", "iter"),
    names_pattern = "^([^0-9] )([123]?)$") %>%
  arrange(Row, iter) %>%
  group_by(PID, Date) %>%
  mutate(iter = row_number()) %>%
  select(-Row) %>%
  tidyr::pivot_wider(
    c("PID", "Date"), names_sep = "",
    names_from = "iter", values_from = c("Drug", "Dose")) %>%
  left_join(dat0, by = c("PID", "Date"))
# # A tibble: 5 x 16
# # Groups:   PID, Date [5]
#   PID   Date       Drug1 Drug2 Drug3 Drug4 Drug5 Drug6 Dose1 Dose2 Dose3 Dose4 Dose5 Dose6   Age Place
#   <chr> <chr>      <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <chr>
# 1 11A   25/10/2021 RPG   NAT   QRT   BET   "SET" "BLT"    12    34     5    10    43    45    45 PMk  
# 2 12B   20/10/2021 ATY   LTP   CRT   <NA>   <NA>  <NA>    13     3     3    NA    NA    NA    56 GTL  
# 3 13A   22/10/2021 GGS   GSF   ERE   DFS   ""    ""        7    12    45     5    NA    NA    45 RKS  
# 4 13A   26/10/2021 BRT   ARR   GSF   <NA>   <NA>  <NA>     9     4    34    NA    NA    NA    46 GLO  
# 5 14B   04/08/2021 GDS   TRE   HHS   <NA>   <NA>  <NA>     2    55    34    NA    NA    NA    25 MTK  

Notes:

  • I broke out dat0 early, since Age and Place don't really fit into the pivot/renumber/pivot mindset.

base R

Here's a base R method that splits (according to your grouping criteria: PID and Date), finds the Drug/Dose columns that need to be renumbered, renames them, and the merges all of the frames back together.

spl <- split(dat, ave(rep(1L, nrow(dat)), dat[,c("PID", "Date")], FUN = seq_along))
spl
# $`1`
#   Row PID       Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 1   1 11A 25/10/2021   RPG    12   NAT    34   QRT     5  45   PMk
# 3   3 12B 20/10/2021   ATY    13   LTP     3   CRT     3  56   GTL
# 4   4 13A 22/10/2021   GGS     7   GSF    12   ERE    45  45   RKS
# 5   5 13A 26/10/2021   BRT     9   ARR     4   GSF    34  46   GLO
# 7   7 14B 04/08/2021   GDS     2   TRE    55   HHS    34  25   MTK
# $`2`
#   Row PID       Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 2   2 11A 25/10/2021   BET    10   SET    43   BLT    45  NA      
# 6   6 13A 22/10/2021   DFS     5          NA          NA  NA      

nms <- lapply(spl, function(x) grep("^(Drug|Dose)", colnames(x), value = TRUE))
nms <- data.frame(i = rep(names(nms), lengths(nms)), oldnm = unlist(nms))
nms$grp <- gsub("[0-9] $", "", nms$oldnm)
nms$newnm <- paste0(nms$grp, ave(nms$grp, nms$grp, FUN = seq_along))
nms <- split(nms, nms$i)

newspl <- Map(function(x, nm) {
  colnames(x)[ match(nm$oldnm, colnames(x)) ] <- nm$newnm
  x
}, spl, nms)
newspl[-1] <- lapply(newspl[-1], function(x) x[, c("PID", "Date", grep("^(Drug|Dose)", colnames(x), value = TRUE)), drop = FALSE ])
newspl
# $`1`
#   Row PID       Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 1   1 11A 25/10/2021   RPG    12   NAT    34   QRT     5  45   PMk
# 3   3 12B 20/10/2021   ATY    13   LTP     3   CRT     3  56   GTL
# 4   4 13A 22/10/2021   GGS     7   GSF    12   ERE    45  45   RKS
# 5   5 13A 26/10/2021   BRT     9   ARR     4   GSF    34  46   GLO
# 7   7 14B 04/08/2021   GDS     2   TRE    55   HHS    34  25   MTK
# $`2`
#   PID       Date Drug4 Dose4 Drug5 Dose5 Drug6 Dose6
# 2 11A 25/10/2021   BET    10   SET    43   BLT    45
# 6 13A 22/10/2021   DFS     5          NA          NA

Reduce(function(a, b) merge(a, b, by = c("PID", "Date"), all = TRUE), newspl)
#   PID       Date Row Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place Drug4 Dose4 Drug5 Dose5 Drug6 Dose6
# 1 11A 25/10/2021   1   RPG    12   NAT    34   QRT     5  45   PMk   BET    10   SET    43   BLT    45
# 2 12B 20/10/2021   3   ATY    13   LTP     3   CRT     3  56   GTL  <NA>    NA  <NA>    NA  <NA>    NA
# 3 13A 22/10/2021   4   GGS     7   GSF    12   ERE    45  45   RKS   DFS     5          NA          NA
# 4 13A 26/10/2021   5   BRT     9   ARR     4   GSF    34  46   GLO  <NA>    NA  <NA>    NA  <NA>    NA
# 5 14B 04/08/2021   7   GDS     2   TRE    55   HHS    34  25   MTK  <NA>    NA  <NA>    NA  <NA>    NA

Notes:

  • The underlying premise of this is that you want to merge the rows onto previous rows. This means (to me) using base::merge or dplyr::full_join; two good links for understanding these concepts, in case you are not aware: dose table(https://i.stack.imgur.com/lc3iN.png)

    Not that elegant as previous ones, but is an idea to see the whole treatment per PID.

  •  Tags:  
  • r
  • Related