How to reshape flat data from SOAM result by R or Bash?-CodePudding

I use SOAP to extract data from the BRENDA enzyme. After extracting I get the following flat data type:

ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6

And I want to reshape data to the following type:

ecNumber	piValue
3.2.1.23	6.9
3.2.1.23	7.1
4.4.1.14	6

Can I do that using the awk function? Or a bash command of some kind? Or R?

CodePudding user response：

In the future, please post your attempted solutions first. It's better to post a question with information about how you've tried to solve it first than just ask 'how do I do this?'

That being said, this is pretty easy to do in R.

library(tidyverse)

# full string
main = "ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6"
# split the string by delimiters
split_vec <- str_split(main, pattern = "#|!")

# arrange into tibble
df <- tibble(split_vec) %>%
  unnest(c(split_vec)) %>%
  mutate(col_name = str_extract(string = split_vec, pattern = "ecNumber|piValue"),
         split_vec = gsub(x = split_vec, pattern = "ecNumber|piValue", "")) %>%
  # trick to make sure that rows 1,2 and 3,4 etc. get labeled together -> this is our needed 'grouper' variable
  mutate(rn = ceiling(row_number()/2)); df
#> # A tibble: 6 × 3
#>   split_vec col_name    rn
#>   <chr>     <chr>    <dbl>
#> 1 3.2.1.23  ecNumber     1
#> 2 6.9       piValue      1
#> 3 3.2.1.23  ecNumber     2
#> 4 7.1       piValue      2
#> 5 4.4.1.14  ecNumber     3
#> 6 6         piValue      3
  
# final answer
df2 <- df %>% 
  # spread the columns wider to get the dataframe into your specifications
  pivot_wider(id_cols = rn, 
              names_from = col_name, 
              values_from = split_vec) %>%
  dplyr::select(-rn)
df2
#> # A tibble: 3 × 2
#>   ecNumber piValue
#>   <chr>    <chr>  
#> 1 3.2.1.23 6.9    
#> 2 3.2.1.23 7.1    
#> 3 4.4.1.14 6

^{Created on 2022-04-15 by the reprex package (v2.0.1)}

CodePudding user response：

In base R, we may use read.dcf after inserting \n

str2 <- gsub("#", "\n", gsub("!", "\n\n", gsub("([a-z])([0-9])", "\\1: \\2", str1))) 
read.dcf(textConnection(str2), all = TRUE)
  ecNumber piValue
1 3.2.1.23     6.9
2 3.2.1.23     7.1
3 4.4.1.14       6

data

str1 <- "ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6"