How to rearrange string data in r?-CodePudding

I've got a column of descriptive data I am looking to subset into new columns. In my column there are genes defined with low to high confidence. For every confidence type I am trying to create its own column with the genes then allocated into them based on which confidence level they are assigned in my original data.

For example my data looks like this:

Gene Prioritization

Medium High:CCNL2 C1orf170 PLEKHN1 RP11-54O7.17 HES4 | Low:AL645608.7 AL390719.1 WASH7P CPTP

Medium High:CCNL2 ATAD3A C1orf222 CALML6 TMEM52 | Medium Low:GNB1  RER1 NADK | Low:AL109917.1

Medium High:PRDM16 | High: ACE

I am looking to convert this into something like:

Low              Medium Low         Medium High    High
AL645608.7       GNB1               CCNL2          ACE
AL390719.1       C1orf170           RER1
...

So every gene is falling into a column that shows what confidence it is assigned (even if a gene has multiple confidences, it can go in multiple columns).

I am not sure where to start to get what I'm looking for. I've been trying to code so I setup the 4 confidences as groups to then use groupby() but I'm not sure what functions to use to get the genes collected into the correct columns.

Example input data:

structure(list(`Gene Prioritization` = c("Medium High:CCNL2 C1orf170 PLEKHN1 RP11-54O7.17 HES4 | Low:AL645608.7 AL390719.1 WASH7P CPTP", 
"Medium High:CCNL2 ATAD3A C1orf222 CALML6 TMEM52 | Medium Low:GNB1  RER1 NADK | Low:AL109917.1", 
"Medium High:SKI PEX10 C1orf86 AL590822.1 RP11-181G12.4 RP11-181G12.5 | Medium Low:CALML6 TMEM52 CFAP74", 
"Medium High:TNFRSF14 PEX10 | Medium Low:RER1 | Low:AL391244.1", 
"Medium High:PRDM16 | High: ACE")), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"))

CodePudding user response：

Here is one option with tidyverse

library(dplyr)
library(tidyr)
library(data.table)
df1 %>% 
  separate_rows(`Gene Prioritization`, sep = "\\s*\\|\\s*") %>% 
  separate( `Gene Prioritization` , into = c("Categ", "Genes"),
      sep = ":\\s*") %>%
  separate_rows(Genes, sep = " ") %>% 
 mutate(rn = rowid(Categ)) %>%
 pivot_wider(names_from = Categ, values_from = Genes) %>%
 select(-rn)