I am trying to separate some rows of a dataframe where commas exist.
This is an example fragment:
name ref alt exon impact score
ID1 A C,T 1,2 l,h 5
ID2 T A 6 h 8
ID3 T G,C,A 2,5,7 l,l,h 10,2,3
ID4 C G 3 m 2
How could I perform the following transformation?
name ref alt exon impact score
ID1 A C 1 l 5
ID1 A T 2 h 5
ID2 T A 6 h 8
ID3 T G 2 l 10
ID3 T C 5 l 2
ID3 T A 7 h 3
ID4 C G 3 m 2
I was trying something as:
df %>%
mutate(alt=strsplit(alt, ",")) %>%
unnest(alt)
However I do not wish to divide the rows just based on one column, but to take into account all the columns where there are more than one value (and to repeat the values where no commas exist).
How can I achieve that?
CodePudding user response:
The function you are looking for is separate_rows
from the package tidyr
.
library(tidyr)
separate_rows(df, everything(), sep = ",")
# A tibble: 7 × 6
name ref alt exon impact score
<chr> <chr> <chr> <chr> <chr> <chr>
1 ID1 A C 1 l 5
2 ID1 A T 2 h 5
3 ID2 T A 6 h 8
4 ID3 T G 2 l 10
5 ID3 T C 5 l 2
6 ID3 T A 7 h 3
7 ID4 C G 3 m 2
Data
df <- structure(list(name = c("ID1", "ID2", "ID3", "ID4"), ref = c("A",
"T", "T", "C"), alt = c("C,T", "A", "G,C,A", "G"), exon = c("1,2",
"6", "2,5,7", "3"), impact = c("l,h", "h", "l,l,h", "m"), score = c("5",
"8", "10,2,3", "2")), class = "data.frame", row.names = c(NA,
-4L))
CodePudding user response:
You can do this with splitstackshape
and zoo
:
library(splitstackshape)
library(zoo)
df <- cSplit(df, 1:ncol(df), "long", sep = ",", type.convert = F)
na.locf(df[rowSums(is.na(df)) != ncol(df),])
Output
name ref alt exon impact score
1: ID1 A C 1 l 5
2: ID1 A T 2 h 5
3: ID2 T A 6 h 8
4: ID3 T G 2 l 10
5: ID3 T C 5 l 2
6: ID3 T A 7 h 3
7: ID4 C G 3 m 2
Data
df <- structure(list(name = c("ID1", "ID2", "ID3", "ID4"), ref = c("A",
"T", "T", "C"), alt = c("C,T", "A", "G,C,A", "G"), exon = c("1,2",
"6", "2,5,7", "3"), impact = c("l,h", "h", "l,l,h", "m"), score = c("5",
"8", "10,2,3", "2")), class = "data.frame", row.names = c(NA,
-4L))