Reshaping dataframe in R for twin data-CodePudding

I have a data frame like the one shown below. Each pair of ID numbers (e.g. 2891, 2892) corresponds to pair of twins.

    ID zyg.x CDsum
1 2891     2     0            
2 2892     2     5            
3 4000     1     0           
4 4001     1     0            
5 4006     2     0
6 4007     2     3

I would like to reshape this data frame to make it look like this... Note that the zyg.x (zygosity) value is the same for each twin in the pair.

           Twin Pair     zyg   CDsumTwin1   CDsumTwin2
1   pair1(2891,2892)       2            0            5
2   pair2(4000,4001)       1            0            0
3   pair3(4006,4007)       2            0            3

Any help would be much appreciated.

CodePudding user response：

Data:

df <- read.table(text = "    ID zyg.x CDsum
1 2891     2     0            
2 2892     2     5            
3 4000     1     0           
4 4001     1     0            
5 4006     2     0
6 4007     2     3")

Arrange by ID and create a variable "twin" to distinguish the two twins in each pair

df<- df %>%
  arrange(ID) %>%
  mutate(twin = rep(c(1, 2),length.out = n()))

df
    ID zyg.x CDsum twin
1 2891     2     0    1
2 2892     2     5    2
3 4000     1     0    1
4 4001     1     0    2
5 4006     2     0    1
6 4007     2     3    2

Split df into two dataframes - for twin1 and twin2

twin1 <- df %>%
  filter(twin == 1) %>%
  select(-twin) %>%
  rename(CDsumTwin1 = CDsum, 
         ID1 = ID)

twin1
   ID1 zyg.x CDsumTwin1
1 2891     2          0
3 4000     1          0
5 4006     2          0

twin2 <- df %>%
  filter(twin == 2) %>%
  select(-twin) %>%
  rename(CDsumTwin2 = CDsum,
         ID2 = ID)

twin2
   ID2 zyg.x CDsumTwin2
2 2892     2          5
4 4001     1          0
6 4007     2          3

cbind, combine and rearrange columns:

twin1 %>% cbind(twin2 %>% select(-zyg.x)) %>%
  mutate(`Twin Pair` = paste0("pair (", ID1, ", ", ID2, ")")) %>%
  select(`Twin Pair`, zyg.x, CDsumTwin1, CDsumTwin2)
    
          Twin Pair zyg.x CDsumTwin1 CDsumTwin2
1 pair (2891, 2892)     2          0          5
3 pair (4000, 4001)     1          0          0
5 pair (4006, 4007)     2          0          3

CodePudding user response：

Here is how we could achieve this with dplyr only:

library(dplyr)
df %>% 
  mutate(rn = ceiling(row_number()/2)) %>% 
  group_by(rn) %>% 
  mutate(Twin_Pair = paste0(ID, collapse = ","),
         Twin_Pair = paste0("pair",rn, "(",Twin_Pair, ")")) %>% 
  mutate(CDsumTwin1 = first(CDsum),
         CDsumTwin2 = last(CDsum), .keep="unused") %>%  
  slice(2) %>% 
  ungroup() %>% 
  select(Twin_Pair, zyg=zyg.x, CDsumTwin1, CDsumTwin2)

output:

  Twin_Pair          zyg CDsumTwin1 CDsumTwin2
  <chr>            <dbl>      <dbl>      <dbl>
1 pair1(2891,2892)     2          0          5
2 pair2(4000,4001)     1          0          0
3 pair3(4006,4007)     2          0          3