Home > OS >  Split one column into two, retain original value if there aren't two values
Split one column into two, retain original value if there aren't two values

Time:03-18

I have a column with specific gene locations in my data frame. It either has the single location (in case of a SNP) or it has a range for the locations (in case of Deletion, Duplication).

I want to split the Gene_Locations column into two columns: Start_Position, Stop Position. Where it has a range, this is easy enough to do. But where there is only one value, I want it to retain the original value in the Gene_Location column, in both the Start_position and Stop_position columns. The second bit has not been easy to do as it it keeps coming up with NA, or a blank whitespace. I'd be grateful for help with this.

This is my expected column:

GeneLocation Start_Position Stop_Position
123456 123456 123456
123456 - 123457 123456 123457
122345 - 122346 122345 122346
134567 134567 134567
123456 123456 123456

This is my actual column:

GeneLocation Start_Position Stop_Position
123456 123456 NA
123456 - 123457 123456 123457
122345 - 122346 122345 122346
134567 134567 NA
123456 123456 NA
dat2 <- data %>% separate(GeneLocation, c('Start_Position', 'Stop_Position'))

CodePudding user response:

coalesce is perfect for this use case! Here is solution with an ifelse statement.

library(dplyr)
library(tidyr)

df %>% 
  separate(GeneLocation, into = c("Start_Position", "Stop_Position"), sep = " - ", remove = FALSE) %>% 
  mutate(Stop_Position = ifelse(is.na(Stop_Position), Start_Position, Stop_Position))
     GeneLocation Start_Position Stop_Position
1          123456         123456        123456
2 123456 - 123457         123456        123457
3 122345 - 122346         122345        122346
4          134567         134567        134567
5          123456         123456        123456

CodePudding user response:

This is a good use case for dplyr::coalesce, which (akin to the SQL function it's named for) returns the first non-NA element from a set of vectors.

library(dplyr)
library(tidyr)
data %>% 
  separate(GeneLocation, c('Start_Position', 'Stop_Position')) %>%
  mutate(Stop_Position = coalesce(Stop_Position, Start_Position))
  • Related