I am working on genomic data and would want to arrange the rows so that a specific gene (say b) is present in the entire column. I am new to R programming and did not find a way to arrange the data as said.
An example of my data is:
c1 c2 c3 c4
1 x b
2 y d a b
3 x a b e
Now, I want all the b's alone in one column and the rest aligned accordingly. I want to organize it into something like this, where all b's are aligned:
c1 c2 c3 c4 c5
1 NA NA x b NA
2 y d a b NA
3 NA x a b e
Can anyone help me out?
CodePudding user response:
May be this works
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
unite(new, everything(), sep="", na.rm =TRUE) %>%
extract(new, into = str_c('c', 3:5), "(.*)(b)(.*)") %>%
mutate(c3 = str_pad(c3, width = max(nchar(c3)))) %>%
separate(c3, into = str_c('c', 1:3), "(?<=.)(?<=.)") %>%
mutate(across(everything(), ~ na_if(trimws(.), "")))
-output
c1 c2 c3 c4 c5
1 <NA> <NA> x b <NA>
2 y d a b <NA>
3 <NA> x a b e
data
df1 <- structure(list(c1 = c("x", "y", "x"), c2 = c("b", "d", "a"),
c3 = c(NA, "a", "b"), c4 = c(NA, "b", "e")),
class = "data.frame", row.names = c("1",
"2", "3"))
CodePudding user response:
It's messy but it might works for different data which contains only one b
per row.
loc <- c()
additional <- c()
for (i in 1:nrow(dat)){
vec <- dat[i,]
x <- which(vec == "b")
y <- sum(!is.na(vec)) - which(vec[!is.na(vec)] == "b")
loc <- c(loc, x)
additional <- c(additional, max(y,0))
}
loc.max <- max(loc)
add.max <- max(additional)
n <- loc.max add.max
res <- c()
for (i in 1:nrow(dat)){
vec <- unname(unlist(dat[i,]))
x <- which(vec == "b")
y <- sum(!is.na(vec)) - which(vec[!is.na(vec)] == "b")
add <- max(y,0)
rowvec <- c(rep(NA, loc.max - x), vec[!is.na(vec)], rep(NA, add.max - add))
print(add)
print(rowvec)
res <- rbind(res, rowvec)
}
coln <- sapply(c(1:5), function(x) paste0("c",x))
res <- as.data.frame(res, row.names = FALSE)
colnames(res) <- coln
c1 c2 c3 c4 c5
1 <NA> <NA> x b <NA>
2 y d a b <NA>
3 <NA> x a b e