How can I convert a 2 column (X, Y), pipe-separated table into an X by Y data frame OR in long forma-CodePudding

I spent the last hour trying to reformat a 2-column format into something more usable.

I have the following input (a 2 column data frame / tibble) :

Input

TGGGAAGGTTATGTGC-1  CMO305|CMO306|CMO312    3698|3806|12182
TGTTCTACATGACAGG-1  CMO305|CMO306|CMO312    3027|1449|4184
ACTGATGCAGAGTGAC-1  CMO305|CMO307   6802|4715
ATCGTCCGTTACCCAA-1  CMO305|CMO307   5599|7019
ATGCATGTCATGACAC-1  CMO305|CMO307   10872|16729
GTGAGTTAGTCCGCCA-1  CMO305|CMO307   10096|3434

Desired output (A - wide)

	CMO305	CMO306	CMO307	CMO312
TGGGAAGGTTATGTGC-1	3698	3806	0	12182
TGTTCTACATGACAGG-1	3027	1449	0	4184
ACTGATGCAGAGTGAC-1	6802	0	4715	0
ATCGTCCGTTACCCAA-1	5599	0	7019	0
ATGCATGTCATGACAC-1	10872	0	16729	0
GTGAGTTAGTCCGCCA-1	10096	0	3434	0

Desired output (B - long format)

> CMO.umis.long
   feature_call num_umis
   <chr>           <dbl>
 1 CMO304           2168
 2 CMO304          14210
 3 CMO304           7009
 4 CMO304           5931
 5 CMO304           7147
 6 CMO304           1683

I am pretty sure this has been answered already, but I can't seem to find the right search terms.

separate_rows() may be the way but I cannot get it to split correclty...

Thank you, I appreciate your help!

CodePudding user response：

Assuming the column names as 'col1', 'col2', 'col3', use separate_rows on the col2 and col3, with sep as | (regex mode is default - so escape the metacharacter | to read it literally), then reshape back to 'wide' with pivot_wider from tidyr

library(dplyr)
library(tidyr)
long_df <- df1 %>%
   mutate(rn = row_number()) %>% 
   separate_rows(c(col2, col3), sep = "\\|", convert = TRUE)

-output

long_df %>%
   select(col2, col3)
# A tibble: 14 × 2
   col2    col3
   <chr>  <int>
 1 CMO305  3698
 2 CMO306  3806
 3 CMO312 12182
 4 CMO305  3027
 5 CMO306  1449
 6 CMO312  4184
 7 CMO305  6802
 8 CMO307  4715
 9 CMO305  5599
10 CMO307  7019
11 CMO305 10872
12 CMO307 16729
13 CMO305 10096
14 CMO307  3434

Or if we need the wide format

wide_df <- long_df %>% 
   pivot_wider(names_from = col2, values_from = col3, values_fill = 0) %>%
    select(-rn)

-output

wide_df
# A tibble: 6 × 5
  col1               CMO305 CMO306 CMO312 CMO307
  <chr>               <int>  <int>  <int>  <int>
1 TGGGAAGGTTATGTGC-1   3698   3806  12182      0
2 TGTTCTACATGACAGG-1   3027   1449   4184      0
3 ACTGATGCAGAGTGAC-1   6802      0      0   4715
4 ATCGTCCGTTACCCAA-1   5599      0      0   7019
5 ATGCATGTCATGACAC-1  10872      0      0  16729
6 GTGAGTTAGTCCGCCA-1  10096      0      0   3434

data

df1 <- structure(list(col1 = c("TGGGAAGGTTATGTGC-1", "TGTTCTACATGACAGG-1", 
"ACTGATGCAGAGTGAC-1", "ATCGTCCGTTACCCAA-1", "ATGCATGTCATGACAC-1", 
"GTGAGTTAGTCCGCCA-1"), col2 = c("CMO305|CMO306|CMO312", "CMO305|CMO306|CMO312", 
"CMO305|CMO307", "CMO305|CMO307", "CMO305|CMO307", "CMO305|CMO307"
), col3 = c("3698|3806|12182", "3027|1449|4184", "6802|4715", 
"5599|7019", "10872|16729", "10096|3434")), 
class = "data.frame", row.names = c(NA, 
-6L))

CodePudding user response：

We could use cSplit function from splitstackshake package to separate the rows and then use pivot_wider as akrun did in his answer:

library(splitstackshape)
library(dplyr)
library(tidyr)

df <- cSplit(df1, c("col2", "col3"), "|", direction = "long")

# output 1
df %>% 
  as_tibble() %>% 
  select(2,3)

# output 2
df %>% 
  pivot_wider(
    names_from = col2,
    values_from = col3,
    values_fill = 0
  )

output1:

   col2    col3
   <chr>  <int>
 1 CMO305  3698
 2 CMO306  3806
 3 CMO312 12182
 4 CMO305  3027
 5 CMO306  1449
 6 CMO312  4184
 7 CMO305  6802
 8 CMO307  4715
 9 CMO305  5599
10 CMO307  7019
11 CMO305 10872
12 CMO307 16729
13 CMO305 10096
14 CMO307  3434

output2:

# A tibble: 6 × 5
  col1               CMO305 CMO306 CMO312 CMO307
  <chr>               <int>  <int>  <int>  <int>
1 TGGGAAGGTTATGTGC-1   3698   3806  12182      0
2 TGTTCTACATGACAGG-1   3027   1449   4184      0
3 ACTGATGCAGAGTGAC-1   6802      0      0   4715
4 ATCGTCCGTTACCCAA-1   5599      0      0   7019
5 ATGCATGTCATGACAC-1  10872      0      0  16729
6 GTGAGTTAGTCCGCCA-1  10096      0      0   3434