I have a column with specific gene locations in my data frame. It either has the single location (in case of a SNP) or it has a range for the locations (in case of Deletion, Duplication).
I want to split the Gene_Locations column into two columns: Start_Position, Stop Position. Where it has a range, this is easy enough to do. But where there is only one value, I want it to retain the original value in the Gene_Location column, in both the Start_position and Stop_position columns. The second bit has not been easy to do as it it keeps coming up with NA, or a blank whitespace. I'd be grateful for help with this.
This is my expected column:
GeneLocation | Start_Position | Stop_Position |
---|---|---|
123456 | 123456 | 123456 |
123456 - 123457 | 123456 | 123457 |
122345 - 122346 | 122345 | 122346 |
134567 | 134567 | 134567 |
123456 | 123456 | 123456 |
This is my actual column:
GeneLocation | Start_Position | Stop_Position |
---|---|---|
123456 | 123456 | NA |
123456 - 123457 | 123456 | 123457 |
122345 - 122346 | 122345 | 122346 |
134567 | 134567 | NA |
123456 | 123456 | NA |
dat2 <- data %>% separate(GeneLocation, c('Start_Position', 'Stop_Position'))
CodePudding user response:
coalesce
is perfect for this use case!
Here is solution with an ifelse
statement.
library(dplyr)
library(tidyr)
df %>%
separate(GeneLocation, into = c("Start_Position", "Stop_Position"), sep = " - ", remove = FALSE) %>%
mutate(Stop_Position = ifelse(is.na(Stop_Position), Start_Position, Stop_Position))
GeneLocation Start_Position Stop_Position
1 123456 123456 123456
2 123456 - 123457 123456 123457
3 122345 - 122346 122345 122346
4 134567 134567 134567
5 123456 123456 123456
CodePudding user response:
This is a good use case for dplyr::coalesce
, which (akin to the SQL function it's named for) returns the first non-NA element from a set of vectors.
library(dplyr)
library(tidyr)
data %>%
separate(GeneLocation, c('Start_Position', 'Stop_Position')) %>%
mutate(Stop_Position = coalesce(Stop_Position, Start_Position))