Divide rows by more than one columns containing commas-CodePudding

I am trying to separate some rows of a dataframe where commas exist.

This is an example fragment:

name  ref   alt    exon   impact score
ID1    A    C,T    1,2     l,h     5
ID2    T    A       6      h       8
ID3    T    G,C,A  2,5,7   l,l,h   10,2,3
ID4    C    G       3      m       2

How could I perform the following transformation?

name   ref   alt   exon   impact  score
ID1    A     C      1       l      5
ID1    A     T      2       h      5
ID2    T     A      6       h      8
ID3    T     G      2       l      10
ID3    T     C      5       l      2
ID3    T     A      7       h      3
ID4    C     G      3       m      2

I was trying something as:

df %>% 
    mutate(alt=strsplit(alt, ",")) %>% 
    unnest(alt)

However I do not wish to divide the rows just based on one column, but to take into account all the columns where there are more than one value (and to repeat the values where no commas exist).

How can I achieve that?

CodePudding user response：

The function you are looking for is separate_rows from the package tidyr.

library(tidyr)

separate_rows(df, everything(), sep = ",")

# A tibble: 7 × 6
  name  ref   alt   exon  impact score
  <chr> <chr> <chr> <chr> <chr>  <chr>
1 ID1   A     C     1     l      5    
2 ID1   A     T     2     h      5    
3 ID2   T     A     6     h      8    
4 ID3   T     G     2     l      10   
5 ID3   T     C     5     l      2    
6 ID3   T     A     7     h      3    
7 ID4   C     G     3     m      2

Data

df <- structure(list(name = c("ID1", "ID2", "ID3", "ID4"), ref = c("A", 
"T", "T", "C"), alt = c("C,T", "A", "G,C,A", "G"), exon = c("1,2", 
"6", "2,5,7", "3"), impact = c("l,h", "h", "l,l,h", "m"), score = c("5", 
"8", "10,2,3", "2")), class = "data.frame", row.names = c(NA, 
-4L))

CodePudding user response：

You can do this with splitstackshape and zoo:

library(splitstackshape)
library(zoo)

df <- cSplit(df, 1:ncol(df), "long", sep = ",", type.convert = F)
na.locf(df[rowSums(is.na(df)) != ncol(df),])

Output

   name ref alt exon impact score
1:  ID1   A   C    1      l     5
2:  ID1   A   T    2      h     5
3:  ID2   T   A    6      h     8
4:  ID3   T   G    2      l    10
5:  ID3   T   C    5      l     2
6:  ID3   T   A    7      h     3
7:  ID4   C   G    3      m     2

Data

df <- structure(list(name = c("ID1", "ID2", "ID3", "ID4"), ref = c("A", 
"T", "T", "C"), alt = c("C,T", "A", "G,C,A", "G"), exon = c("1,2", 
"6", "2,5,7", "3"), impact = c("l,h", "h", "l,l,h", "m"), score = c("5", 
"8", "10,2,3", "2")), class = "data.frame", row.names = c(NA, 
-4L))