I use SOAP to extract data from the BRENDA enzyme. After extracting I get the following flat data type:
ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6
And I want to reshape data to the following type:
ecNumber | piValue |
---|---|
3.2.1.23 | 6.9 |
3.2.1.23 | 7.1 |
4.4.1.14 | 6 |
Can I do that using the awk function? Or a bash command of some kind? Or R?
CodePudding user response:
In the future, please post your attempted solutions first. It's better to post a question with information about how you've tried to solve it first than just ask 'how do I do this?'
That being said, this is pretty easy to do in R
.
library(tidyverse)
# full string
main = "ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6"
# split the string by delimiters
split_vec <- str_split(main, pattern = "#|!")
# arrange into tibble
df <- tibble(split_vec) %>%
unnest(c(split_vec)) %>%
mutate(col_name = str_extract(string = split_vec, pattern = "ecNumber|piValue"),
split_vec = gsub(x = split_vec, pattern = "ecNumber|piValue", "")) %>%
# trick to make sure that rows 1,2 and 3,4 etc. get labeled together -> this is our needed 'grouper' variable
mutate(rn = ceiling(row_number()/2)); df
#> # A tibble: 6 × 3
#> split_vec col_name rn
#> <chr> <chr> <dbl>
#> 1 3.2.1.23 ecNumber 1
#> 2 6.9 piValue 1
#> 3 3.2.1.23 ecNumber 2
#> 4 7.1 piValue 2
#> 5 4.4.1.14 ecNumber 3
#> 6 6 piValue 3
# final answer
df2 <- df %>%
# spread the columns wider to get the dataframe into your specifications
pivot_wider(id_cols = rn,
names_from = col_name,
values_from = split_vec) %>%
dplyr::select(-rn)
df2
#> # A tibble: 3 × 2
#> ecNumber piValue
#> <chr> <chr>
#> 1 3.2.1.23 6.9
#> 2 3.2.1.23 7.1
#> 3 4.4.1.14 6
Created on 2022-04-15 by the reprex package (v2.0.1)
CodePudding user response:
In base R
, we may use read.dcf
after inserting \n
str2 <- gsub("#", "\n", gsub("!", "\n\n", gsub("([a-z])([0-9])", "\\1: \\2", str1)))
read.dcf(textConnection(str2), all = TRUE)
ecNumber piValue
1 3.2.1.23 6.9
2 3.2.1.23 7.1
3 4.4.1.14 6
data
str1 <- "ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6"