Home > Net >  Collapsing data per family
Collapsing data per family

Time:03-28

I have this data set, with values for twins within families:

zyg   fid    x_t1    x_t2     y_t1   y_t2
 1 499474     NA     1      1    NA
 1 499474     NA     NA    NA    NA
 1 499474     NA     NA    NA     1
 1 499474     NA     NA    NA    NA
 1 499540     NA     NA     1    NA
 1 499540     NA     NA    NA    NA
 2 499874     NA     NA    NA    NA
 2 499874     NA     NA     1    NA
 2 499874     NA     NA    NA     1
 2 499874     2      NA    NA     1 
  • How do I collapse the families retaining the phenotype information for x and y, when these are present?

The expected for family 499479 is:

zyg   fid    x_t1    x_t2  y_t1   y_t2
 1 499474     NA     1      1     1

and for family 499874, it should be:

 2 499874     2      NA    1     1 

CodePudding user response:

You can use the following code:

library(dplyr)

df %>%
  group_by(fid) %>%
  summarise_all(~first(na.omit(.)))

Output:

# A tibble: 3 × 6
     fid   zyg  x_t1  x_t2  y_t1  y_t2
   <int> <int> <int> <int> <int> <int>
1 499474     1    NA     1     1     1
2 499540     1    NA    NA     1    NA
3 499874     2     2    NA     1     1

Your data:

df<-structure(list(zyg = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), fid = c(499474L, 499474L, 499474L, 499474L, 499540L, 499540L, 
499874L, 499874L, 499874L, 499874L), x_t1 = c(NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 2L), x_t2 = c(1L, NA, NA, NA, NA, NA, NA, 
NA, NA, NA), y_t1 = c(1L, NA, NA, NA, 1L, NA, NA, 1L, NA, NA), 
    y_t2 = c(NA, NA, 1L, NA, NA, NA, NA, NA, 1L, 1L)), class = "data.frame", row.names = c(NA, 
-10L))

CodePudding user response:

If there are only one non-NA element, per group

library(dplyr)
df1 %>%
 group_by(zyg, fid) %>%
  summarise(across(everything(), ~ .x[complete.cases(.x)][1]), .groups = "drop")

-output

# A tibble: 3 × 6
    zyg    fid  x_t1  x_t2  y_t1  y_t2
  <int>  <int> <int> <int> <int> <int>
1     1 499474    NA     1     1     1
2     1 499540    NA    NA     1    NA
3     2 499874     2    NA     1     1

data

df1 <- structure(list(zyg = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), fid = c(499474L, 499474L, 499474L, 499474L, 499540L, 499540L, 
499874L, 499874L, 499874L, 499874L), x_t1 = c(NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 2L), x_t2 = c(1L, NA, NA, NA, NA, NA, NA, 
NA, NA, NA), y_t1 = c(1L, NA, NA, NA, 1L, NA, NA, 1L, NA, NA), 
    y_t2 = c(NA, NA, 1L, NA, NA, NA, NA, NA, 1L, 1L)), 
class = "data.frame", row.names = c(NA, 
-10L))

CodePudding user response:

Another possible solution:

library(dplyr)

df %>% 
  group_by(zyg, fid) %>% 
  summarise(across(everything(), ~ if (all(is.na(.x))) {NA} else
           {max(.x, na.rm = T)}), .groups = "drop")

#> # A tibble: 3 × 6
#>     zyg    fid  x_t1  x_t2  y_t1  y_t2
#>   <int>  <int> <int> <int> <int> <int>
#> 1     1 499474    NA     1     1     1
#> 2     1 499540    NA    NA     1    NA
#> 3     2 499874     2    NA     1     1

CodePudding user response:

very similiar to the other answers but wanted to give my own solution too.

df %>%
  group_by(zyg,fid) %>% 
  summarise(across(everything(),~sum(.,na.rm=TRUE))
            )

CodePudding user response:

You want to do something what coalesce does by rows for columns:

Here is how you could do it:

libarary(dplyr)

coalesce_by_column <- function(df) {
  return(dplyr::coalesce(!!! as.list(df)))
}

df %>%
  group_by(fid) %>%
  summarise(across(everything(), coalesce_by_column))
     fid   zyg  x_t1  x_t2  y_t1  y_t2
   <int> <int> <int> <int> <int> <int>
1 499474     1    NA     1     1     1
2 499540     1    NA    NA     1    NA
3 499874     2     2    NA     1     1
  • Related