How to find duplicate dates within a row in R, and then replace associated values with the mean?-CodePudding

There are some similar questions, however I haven't been able to find the solution for my data:

ID <- c(27,46,72)
Gest1 <- c(27,28,29)
Sys1 <- c(120,123,124)
Dia1 <- c(90,89,92)
Gest2 <- c(29,28,30)
Sys2 <- c(122,130,114)
Dia2 <- c(89,78,80)
Gest3 <- c(32,29,30)
Sys3 <- c(123,122,124)
Dia3 <- c(90,88,89)
Gest4 <- c(33,30,32)
Sys4 <- c(124,123,128)
Dia4 <- c(94,89,80)

df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3,
                   Dia3,Gest4,Sys4,Dia4)

df.1

What I need to do is identify where there are any cases of gestational age duplicates (variables beginning with Gest), and then find the mean of the associated Sys and Dia variables. Once the mean has been calculated, I need to replace the duplicates with just 1 Gest variable, and the mean of the Sys variable and the mean of the Dia variable. Everything after those duplicates should then be moved up the dataframe.

Here is what it should look like:

df.2

My real data has 25 Gest variables with 25 associated Sys variables and 25 association Dia variables.

Sorry if this is confusing! I've tried to write an ok question but it is my first time using stack overflow.

Thank you!!

CodePudding user response：

This is easier to manage in long (and tidy) format.

Using tidyverse, you can use pivot_longer to put into long form. After grouping by ID and Gest you can substitute Sys and Dia values with the mean. If there are more than one Gest for a given ID it will then use the average.

Then, you can keep that row of data with slice. After grouping by ID, you can renumber after combining those with common Gest values.

library(tidyverse)

df.1 %>%
  pivot_longer(cols = -ID, names_to = c(".value", "number"), names_pattern = "(\\w )(\\d )") %>%
  group_by(ID, Gest) %>%
  mutate(across(c(Sys, Dia), mean)) %>%
  slice(1) %>%
  group_by(ID) %>%
  mutate(number = row_number())

Output

      ID number  Gest   Sys   Dia
   <dbl>  <int> <dbl> <dbl> <dbl>
 1    27      1    27  120   90  
 2    27      2    29  122   89  
 3    27      3    32  123   90  
 4    27      4    33  124   94  
 5    46      1    28  126.  83.5
 6    46      2    29  122   88  
 7    46      3    30  123   89  
 8    72      1    29  124   92  
 9    72      2    30  119   84.5
10    72      3    32  128   80

Note - I would keep in long form - but if you wanted wide again, you can add:

pivot_wider(id_cols = ID, names_from = number, values_from = c(Gest, Sys, Dia))

CodePudding user response：

This involved change the structure of the table into the long format, averaging the duplicates and then reformatting back into the desired table:

library(tidyr)
library(dplyr)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3, Dia3,Gest4,Sys4,Dia4)

#convert data to long format
longdf <- df.1 %>% pivot_longer(!ID, names_to = c(".value", "time"), names_pattern = "(\\D )(\\d)", values_to="count")

#average duplicate rows
temp<-longdf %>% group_by(ID, Gest) %>% summarize(Sys=mean(Sys), Dia=mean(Dia)) %>%  mutate(time = row_number())

#convert back to wide format
answer<-temp %>% pivot_wider(ID, names_from = time, values_from = c("Gest", "Sys", "Dia"), names_glue = "{.value}{time}")
#resort the columns
answer <-answer[ , names(df.1)]
answer
# A tibble: 3 × 13
# Groups:   ID [3]
       ID Gest1  Sys1  Dia1 Gest2  Sys2  Dia2 Gest3  Sys3  Dia3 Gest4  Sys4  Dia4
     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
   1    27    27  120   90      29   122  89      32   123    90    33   124    94
   2    46    28  126.  83.5    29   122  88      30   123    89    NA    NA    NA
   3    72    29  124   92      30   119  84.5    32   128    80    NA    NA    NA