R data arrangement for metagenomic data-CodePudding

I attach you an example with my attempts because I am not able to manage / arrange data through R code. I have a datraframe that first column is the taxonomic lineage of microorganisms. And each column is a DNA sequence recodified by ASV1 and so on.

For each column, only some of its values will have value ==1. The rest will be 0.

I attach below the code to be reproducible. The RData to load the dataframe file is freely-available on: https://www.jottacloud.com/s/191545e30dc99e14823959fadba6d189be5


data<-read_xlsx("combined_allranks_mpa.xlsx")


datastackoverchange<-data
datastackoverchange <-as.data.frame(datastackoverchange)


names(datastackoverchange)[2:3812] <- sprintf("ASV_%d",seq(1:3811))

save.image("stackoverflow_data.RData")

# I perform a subset of the first two columns

data1<-datastackoverchange[ , c(1,2)]

# Each column has a plenty of zeros except for the lineage that correspond. 

I remove all zeroes that are not of interest by:
data1[data1==0] <- NA
data1<-data1[complete.cases(data1),]

# And I obtain the next table (see the link of the image)

[![The column ASV1 have 4 rows of value "1" because each "1" value 
arrives to a specific lineage rank]
([https://i.stack.imgur.com/OZi9W.jpg][1])]([https://i.stack.imgur.com/OZi9W.jpg][1])

# In the first example (subset c(1,2) I have that the most
complete ASV1 (most length) it is 
k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales. Usually,
 the longest ASV lineage it will appear in the last position in the dataframe.

# Nevertheless, from this step I would like to create
# maybe from an empty datafame or list that copies me for example:


| Column A | Column B |
| -------- | -------- |
| ASV1   | k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales  |
| ASV2   | and so on   |

#    
# and so on for each column (ASV2, ASV3...) creating a loop to iterize it
# In order to exploit the data (I have 3811 different ASV) for further analysis.```

Thanks on advance for your hints and helps about how can I overcome this situation.

Many thanks

Regards


  [1]: https://i.stack.imgur.com/OZi9W.jpg

CodePudding user response：

Try this :

values <- apply(datastackoverchange[,2:ncol(datastackoverchange)],2,FUN = function(x)datastackoverchange$Classification[which(x==1) %>% dplyr::last()])

id <- colnames(datastackoverchange[,2:ncol(datastackoverchange)])

df <- data.frame(id, values)

CodePudding user response：

Continuing my issue and for stackoverflow issue (Extract Row and Column Name if the value for the cell in the data frame is greater than 0 and save value and row and column name to empty data frame) I achieved to advance:

Here's the code from the RData submitted in my previous comment in this page: ´´´ load("stackoverflow_data.RData") datastackoverchange <-as.data.frame(datastackoverchange)

library(tidyverse)

dat_clean_def<-datastackoverchange %>% remove_rownames %>% column_to_rownames(var="Classification")

idx <- which(dat_clean_def == "1", arr.ind=TRUE) results <- data.frame(Row=rownames(dat_clean_def)[idx[, 1]], Col=colnames(dat_clean_def)[idx[, 2]], Val=dat_clean_def[idx]) results

´´´

Nevertheless, I need to retain only the logest lineage, e.g.:

Row	Column	Value
k__Bacteria	ASV_1	1
k__Bacteria;p__Firmicutes	ASV_1	1
k__Bacteria;p__Firmicutes;c__Clostridia	ASV_1	1
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales	ASV_1	1
k__Bacteria	ASV_2	1

Then I am seeking for a function that for each diferent column value choose the Row column more large (with more "_").

Using stringr() ?

Thanks another time