Separating into separate columns based on 2 delimiters-CodePudding

I'm an R beginner and have a basic question I can't seem to figure out. I have row values that need to be separated into different columns, but there are more than one delimiter I am trying to use. The Expression_level column contains an ensembl gene ID with its corresponding value, in the form ensembl:exp value, but there are sometimes 2 ensembl IDs in the same row separated by ;. I want to have a column for ensembl and for gene expression value, but not sure how to separate while keeping them mapped to the correct ID/expression value. This is the type of data I am working with: rna_seq and this is what I am trying to get out: org_rna. TYIA

rna_seq= cbind("Final_gene" = c("KLHL15", "CPXCR1", "MAP7D3", "WDR78"), "Expression_level" = c("1.62760683812965:ENSG00000174010", "-9.96578428466209:ENSG00000147183",
                                                                                           "-4.32192809488736:ENSG00000129680", "-1.39592867633114:ENSG00000152763;-9.96578428466209:ENSG00000231080"))

org_rna = cbind("Final_gene" = c("KLHL15", "CPXCR1", "MAP7D3", "WDR78", "WDR78"), "Ensembl" = c("ENSG00000174010", "ENSG00000147183", "ENSG00000129680", "ENSG00000152763", "ENSG00000231080")
            , "Expression" = c("1.62760683812965", "-9.96578428466209",  "-4.32192809488736", "-1.39592867633114", "-9.96578428466209"))

CodePudding user response：

library(tidyr)
library(dplyr)
rna_seq %>% 
    as.data.frame() %>% 
    
    # separate cells containing multiple values into
    # multiple rows
    separate_rows(Expression_level, sep = ";") %>% 
    
    # extract pairs
    extract(col = Expression_level, 
            into = c("Expression", "Ensembl"), 
            regex = "(.*):(.*)")

# A tibble: 5 x 3
# Final_gene Expression        Ensembl        
# <chr>      <chr>             <chr>          
# KLHL15     1.62760683812965  ENSG00000174010
# CPXCR1     -9.96578428466209 ENSG00000147183
# MAP7D3     -4.32192809488736 ENSG00000129680
# WDR78      -1.39592867633114 ENSG00000152763
# WDR78      -9.96578428466209 ENSG00000231080

CodePudding user response：

Another (less elegant) solution using separate():

library(tidyr)
library(dplyr)

rna_seq |> as.data.frame() |>
  # Separate any second IDs
  separate(Expression_level, sep = ";", into = c("ID1", "ID2")) |>
  # Reshape to longer (columns to rows)
  pivot_longer(cols = starts_with("ID")) |> 
  # Separate Expression from Ensembl
  separate(value, sep = ":", into = c("Expression", "Ensembl")) |>
  filter(!is.na(Expression)) |> 
  select(Final_gene, Ensembl, Expression)