I attach you an example with my attempts because I am not able to manage / arrange data through R code. I have a datraframe that first column is the taxonomic lineage of microorganisms. And each column is a DNA sequence recodified by ASV1 and so on.
For each column, only some of its values will have value ==1. The rest will be 0.
I attach below the code to be reproducible. The RData to load the dataframe file is freely-available on: https://www.jottacloud.com/s/191545e30dc99e14823959fadba6d189be5
data<-read_xlsx("combined_allranks_mpa.xlsx")
datastackoverchange<-data
datastackoverchange <-as.data.frame(datastackoverchange)
names(datastackoverchange)[2:3812] <- sprintf("ASV_%d",seq(1:3811))
save.image("stackoverflow_data.RData")
# I perform a subset of the first two columns
data1<-datastackoverchange[ , c(1,2)]
# Each column has a plenty of zeros except for the lineage that correspond.
I remove all zeroes that are not of interest by:
data1[data1==0] <- NA
data1<-data1[complete.cases(data1),]
# And I obtain the next table (see the link of the image)
[![The column ASV1 have 4 rows of value "1" because each "1" value
arrives to a specific lineage rank]
([https://i.stack.imgur.com/OZi9W.jpg][1])]([https://i.stack.imgur.com/OZi9W.jpg][1])
# In the first example (subset c(1,2) I have that the most
complete ASV1 (most length) it is
k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales. Usually,
the longest ASV lineage it will appear in the last position in the dataframe.
# Nevertheless, from this step I would like to create
# maybe from an empty datafame or list that copies me for example:
| Column A | Column B |
| -------- | -------- |
| ASV1 | k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales |
| ASV2 | and so on |
#
# and so on for each column (ASV2, ASV3...) creating a loop to iterize it
# In order to exploit the data (I have 3811 different ASV) for further analysis.```
Thanks on advance for your hints and helps about how can I overcome this situation.
Many thanks
Regards
[1]: https://i.stack.imgur.com/OZi9W.jpg
CodePudding user response:
Try this :
values <- apply(datastackoverchange[,2:ncol(datastackoverchange)],2,FUN = function(x)datastackoverchange$Classification[which(x==1) %>% dplyr::last()])
id <- colnames(datastackoverchange[,2:ncol(datastackoverchange)])
df <- data.frame(id, values)
CodePudding user response:
Continuing my issue and for stackoverflow issue (Extract Row and Column Name if the value for the cell in the data frame is greater than 0 and save value and row and column name to empty data frame) I achieved to advance:
Here's the code from the RData submitted in my previous comment in this page: ´´´ load("stackoverflow_data.RData") datastackoverchange <-as.data.frame(datastackoverchange)
library(tidyverse)
dat_clean_def<-datastackoverchange %>% remove_rownames %>% column_to_rownames(var="Classification")
idx <- which(dat_clean_def == "1", arr.ind=TRUE) results <- data.frame(Row=rownames(dat_clean_def)[idx[, 1]], Col=colnames(dat_clean_def)[idx[, 2]], Val=dat_clean_def[idx]) results
´´´
Nevertheless, I need to retain only the logest lineage, e.g.:
Row | Column | Value |
---|---|---|
k__Bacteria | ASV_1 | 1 |
k__Bacteria;p__Firmicutes | ASV_1 | 1 |
k__Bacteria;p__Firmicutes;c__Clostridia | ASV_1 | 1 |
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales | ASV_1 | 1 |
k__Bacteria | ASV_2 | 1 |
Then I am seeking for a function that for each diferent column value choose the Row column more large (with more "_").
Using stringr() ?
Thanks another time