R: Creating two variables from a single string (Protein names)-CodePudding

I have some protein data in the format of variable / value. The 'value' is self-explanatory. The 'variable' is a string in the form 'PRTN_ASSAYCODE' where 'PRTN' is a particular protein, and 'ASSAYCODE' is a separate string for the sequence used to detect the protein. For a given protein, there are one, two or three different sequences.

What I'm trying to do is to split the string into two separate variables, and use them for a facet_grid in ggplot (proteins shown vertically and the different method for each shown horizontally). To do this, I need to create a new variable (1,2 or 3, or some other factor).

For example:

input            output
ALBU_AAFZXAA --> ALBU, 1
ALBU_AAFZXAA --> ALBU, 1
ALBU_ABGHHSA --> ALBU, 2
FIBR_HFGIAAO --> FIBR, 1
FIBR_YOUSAAA --> FIBR, 2
FIBR_ERAATTA --> FIBR, 3

I can use strsplit to split the string, I.e. I have the protein code, but not the assay code in a usable form.

My best guess so far is to use a for loop to run down the dataframe, looking for changes in the first part of string, then annotating any change in the second part of the string. But it's really cumbersome and error-prone.

Any helpful ideas? My dataframe has ~3000 rows so annotating manually is not an option.

CodePudding user response：

Using data.table function tstrplit() and rleid() - the former is splitting the string, the latter is creating sequential. The bymakes rleid() reset for each protein.

library(data.table)
data <- data.table(
  protein = c("ABC_DFG", "ABC_DFG", "ABC_HIJ", "XYZ_TUV")
)
# Solution:
data[, `:=`("ID1" = tstrsplit(protein, "_")[[1]], 
            "ID2" = rleid(tstrsplit(protein, "_")[[2]])),
     by=tstrsplit(protein, "_")[[1]]]

Results in

> data
   protein ID1 ID2
1: ABC_DFG ABC   1
2: ABC_DFG ABC   1
3: ABC_HIJ ABC   2
4: XYZ_TUV XYZ   1

A tidier bit of code, using data.table chaining (DT[][])

data[, ID1 := tstrsplit(protein, "_")[[1]]][, 
       ID2 := rleid(tstrsplit(protein, "_")[[2]]), by=ID1]

CodePudding user response：

Use tidyr::separate. You can then use v1and v2 as unique identifier for your facet_grid.

data %>% separate(protein, c("v1","v2"))
    v1      v2
1 ALBU AAFZXAA
2 ALBU AAFZXAA
3 ALBU ABGHHSA
4 FIBR HFGIAAO
5 FIBR YOUSAAA
6 FIBR ERAATTA

To get a numeric id, add data.table::rleid.

data %>% separate(protein, c("v1","v2")) %>% 
  group_by(v1) %>% 
  mutate(group = data.table::rleid(v2))

Data

data <- data.frame(protein = c("ALBU_AAFZXAA", "ALBU_AAFZXAA", "ALBU_ABGHHSA", 
                              "FIBR_HFGIAAO","FIBR_YOUSAAA","FIBR_ERAATTA"))