Organizing rows of a data in R-CodePudding

I am working on genomic data and would want to arrange the rows so that a specific gene (say b) is present in the entire column. I am new to R programming and did not find a way to arrange the data as said.

An example of my data is:

  c1 c2 c3 c4
1 x  b
2 y  d  a  b  
3 x  a  b  e

Now, I want all the b's alone in one column and the rest aligned accordingly. I want to organize it into something like this, where all b's are aligned:

  c1 c2 c3 c4 c5
1 NA NA x  b  NA
2 y  d  a  b  NA
3 NA x  a  b  e

Can anyone help me out?

CodePudding user response：

May be this works

library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
    unite(new, everything(), sep="", na.rm =TRUE) %>% 
    extract(new, into = str_c('c', 3:5), "(.*)(b)(.*)") %>%
    mutate(c3 = str_pad(c3, width = max(nchar(c3)))) %>%
    separate(c3, into = str_c('c', 1:3), "(?<=.)(?<=.)") %>% 
    mutate(across(everything(), ~ na_if(trimws(.), "")))

-output

    c1   c2 c3 c4   c5
1 <NA> <NA>  x  b <NA>
2    y    d  a  b <NA>
3 <NA>    x  a  b    e

data

df1 <- structure(list(c1 = c("x", "y", "x"), c2 = c("b", "d", "a"), 
    c3 = c(NA, "a", "b"), c4 = c(NA, "b", "e")), 
class = "data.frame", row.names = c("1", 
"2", "3"))

CodePudding user response：

It's messy but it might works for different data which contains only one b per row.

loc <- c()
additional <- c()
for (i in 1:nrow(dat)){
  vec <- dat[i,]
  x <- which(vec == "b")
  y <- sum(!is.na(vec)) - which(vec[!is.na(vec)] == "b")
  
  loc <- c(loc, x)
  additional <- c(additional, max(y,0))
}

loc.max <- max(loc)
add.max <- max(additional)
n <- loc.max   add.max

res <- c()
for (i in 1:nrow(dat)){
  vec <- unname(unlist(dat[i,]))
  x <- which(vec == "b")
  y <- sum(!is.na(vec)) - which(vec[!is.na(vec)] == "b")
  add <- max(y,0)
  rowvec <- c(rep(NA, loc.max - x), vec[!is.na(vec)], rep(NA, add.max - add))
  print(add)
  print(rowvec)
  res <- rbind(res, rowvec)
}
coln <- sapply(c(1:5), function(x) paste0("c",x))
res <- as.data.frame(res, row.names = FALSE)
colnames(res) <- coln

    c1   c2 c3 c4   c5
1 <NA> <NA>  x  b <NA>
2    y    d  a  b <NA>
3 <NA>    x  a  b    e