UPDATED QUESTION
I have this character vector
str_ <- "H3K9me0S10ph1K14ac1me0"
I would like to break it into pieces such that I get an output like:
"H3K9: me0 | S10: ph1 | K14: ac1,me0"
Preferably this is done in a manner that utilizes {dplyr}, such that I can perform this operation on a tibble and get a new column with the desired character string output. Any ideas?
As the below section suggests, I'm struggling with getting a table that denotes which modifications are paired with what, e.g. that the me0 goes with H3K9 and BOTH the ac1,me0 go with K14
Any assistance would be so helpful!
Pieces of attempts
Using a slightly different example,
str_ <- "H3K9ac1K14ac1K18ac1me0"
So I've tried breaking the character vector into pieces by extracting all "me[0-9]*" or "ac[0-9]*" etc, then giving them an id which corresponds to their index in the character vector.
# A tibble: 4 x 2
i m
<int> <chr>
1 12 ac1
2 17 ac1
3 23 ac1
4 26 me0
I need a way to create a column together
that tells whether two modifications belong to the same protein, i.e. in this example K14 has ac1 and me0, so their 'together' values should be 'TRUE'. I've tried using the distance between their indices as a surrogate for togetherness, but I don't think this is the best way to do it:
# A tibble: 4 x 2
i m unit_diff together
<int> <chr> <int> <lgl>
1 12 ac1 0 FALSE
2 17 ac1 5 FALSE
3 23 ac1 6 TRUE
4 26 me0 3 TRUE
Any ideas? I've tried using modulo 3, but this doesn't seem to generalize. Is this even the correct way to be doing this? I'm open to suggestions
CodePudding user response:
Use diff
to create the 'unit_diff' and then use %%
library(dplyr)
df1 %>%
mutate(unit_diff = c(0, diff(i)),
together = unit_diff %% 3 == 0 & unit_diff != 0)
-output
# A tibble: 4 × 4
i m unit_diff together
<dbl> <chr> <dbl> <lgl>
1 12 ac1 0 FALSE
2 17 ac1 5 FALSE
3 23 ac1 6 TRUE
4 26 me0 3 TRUE
If we want to check the TRUE adjacent to n
number of values, use rleid
or rle
from base R
library(data.table)
n <- 2
df1 %>%
mutate(unit_diff = c(0, diff(i)),
together = unit_diff %% 3 == 0 & unit_diff != 0) %>%
group_by(grp = rleid(together)) %>%
mutate(together = all(together) & n() == n) %>%
ungroup %>%
select(-grp)
For the second updated question, we can use regex to insert some delimiters - i.e. originally, we capture one or more characters that are not lowercase letters (([^a-z] )
) and replace with the backreference of the captured group followed by :
(\\1:
), then, we insert the |
between characters that are a lowercase letter followed by digit and an uppercase letter, remove the lagging :
at the end with trimws
and finally replace the :
with ,
between the one or more lower case letter followed by one or more digits
gsub("([a-z] \\d ):", "\\1,",
trimws(gsub("(?<=[a-z][0-9])(?=[A-Z])", " | ",
gsub("([^a-z] )", "\\1: ", str_), perl = TRUE), whitespace = ":\\s "))
[1] "H3K9: me0 | S10: ph1 | K14: ac1, me0"
data
df1 <- structure(list(i = c(12, 17, 23, 26), m = c("ac1", "ac1", "ac1",
"me0")), class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA,
-4L))
CodePudding user response:
I wrote a small function for your output: Note I am not very experienced in writing functions!
str_ <- "H3K9me0S10ph1K14ac1me0"
library(stringr)
library(knitr)
clean_func <- function(str_, x, y) {
x <- str_extract_all(str_, '([a-z] [0-9] )')[[1]]
y <- strsplit(str_replace_all(str_, paste(x, collapse = '|'), ' '), ' ')[[1]]
x[3] <- knitr::combine_words(x[3:4], and=",")
x1 <- x[1:3]
y1 <- y[1:3]
result <- paste(paste(y1, x1, sep = ": "), collapse = " | ")
return(result)
}
clean_func(str_)
[1] "H3K9: me0 | S10: ph1 | K14: ac1,me0"