I'm an R beginner and have a basic question I can't seem to figure out. I have row values that need to be separated into different columns, but there are more than one delimiter I am trying to use. The Expression_level column contains an ensembl gene ID with its corresponding value, in the form ensembl:exp value, but there are sometimes 2 ensembl IDs in the same row separated by ;. I want to have a column for ensembl and for gene expression value, but not sure how to separate while keeping them mapped to the correct ID/expression value. This is the type of data I am working with: rna_seq and this is what I am trying to get out: org_rna. TYIA
rna_seq= cbind("Final_gene" = c("KLHL15", "CPXCR1", "MAP7D3", "WDR78"), "Expression_level" = c("1.62760683812965:ENSG00000174010", "-9.96578428466209:ENSG00000147183",
"-4.32192809488736:ENSG00000129680", "-1.39592867633114:ENSG00000152763;-9.96578428466209:ENSG00000231080"))
org_rna = cbind("Final_gene" = c("KLHL15", "CPXCR1", "MAP7D3", "WDR78", "WDR78"), "Ensembl" = c("ENSG00000174010", "ENSG00000147183", "ENSG00000129680", "ENSG00000152763", "ENSG00000231080")
, "Expression" = c("1.62760683812965", "-9.96578428466209", "-4.32192809488736", "-1.39592867633114", "-9.96578428466209"))
CodePudding user response:
library(tidyr)
library(dplyr)
rna_seq %>%
as.data.frame() %>%
# separate cells containing multiple values into
# multiple rows
separate_rows(Expression_level, sep = ";") %>%
# extract pairs
extract(col = Expression_level,
into = c("Expression", "Ensembl"),
regex = "(.*):(.*)")
# A tibble: 5 x 3
# Final_gene Expression Ensembl
# <chr> <chr> <chr>
# KLHL15 1.62760683812965 ENSG00000174010
# CPXCR1 -9.96578428466209 ENSG00000147183
# MAP7D3 -4.32192809488736 ENSG00000129680
# WDR78 -1.39592867633114 ENSG00000152763
# WDR78 -9.96578428466209 ENSG00000231080
CodePudding user response:
Another (less elegant) solution using separate()
:
library(tidyr)
library(dplyr)
rna_seq |> as.data.frame() |>
# Separate any second IDs
separate(Expression_level, sep = ";", into = c("ID1", "ID2")) |>
# Reshape to longer (columns to rows)
pivot_longer(cols = starts_with("ID")) |>
# Separate Expression from Ensembl
separate(value, sep = ":", into = c("Expression", "Ensembl")) |>
filter(!is.na(Expression)) |>
select(Final_gene, Ensembl, Expression)