I have input data which looks like this:
DF()
**symbol | sample1 | sample2 | sample3** |
---|---|---|---|
Cohort | 0 | 1 | 0 |
gene1 | 2334 | 99467 | 3782 |
gene2 | 3889 | 4893 | 22891 |
and I want to separate "Cohort" and the column names and make a separate data frame. Something like this:
symbol | Cohort |
---|---|
sample1 | 0 |
sample2 | 1 |
I tried this:
DF<- data %>% filter(row_number() == 1)
data1<-t(DF)
but got this:
V1 | |
---|---|
symbol | Cohort |
sample1 | 0 |
sample2 | 1 |
Can somebody help me out?
CodePudding user response:
your question is a bit tough to parse, but I think the code below should help.
I've used tidyverse which is a common package for reshaping and tidying data (https://r4ds.had.co.nz/ is a good resource for this and more).
There are alternatives such as reshape() in base R, though I'm not very familiar with that.
What I'm doing is:
- Loading tidyverse with library()
- Creating your sample dataframe using tribble()
On this df:
- pivot_longer() stacks the columns defined by the cols() argument
- starts_with() selects every column whose name begins with "sample"
- filter() removes any column where "Symbol" doesn't equal "Cohort"
- select() chooses specific columns and renames in a single step
library(tidyverse)
df <- tribble(
~"symbol", ~"sample1", ~"sample2", ~"sample3",
"Cohort", 0, 1, 0,
"gene1", 2334, 99467, 3782,
"gene2", 3889, 4893, 22891
)
df %>%
pivot_longer(
cols = starts_with("sample")
) %>%
filter(symbol == "Cohort") %>%
select(symbol = name,
cohort = value)
CodePudding user response:
The data set is quite messy, but you can do:
library(dplyr)
dat %>%
filter(symbol == "Cohort") %>%
t() %>% as.data.frame() %>%
tibble::rownames_to_column() %>%
janitor::row_to_names(1)
symbol Cohort
2 sample1 0
3 sample2 1
4 sample3 0
CodePudding user response:
Read the file twice to get genes and cohort rows separately, for example:
myFile = "symbol sample1 sample2 sample3
Cohort 0 1 0
gene1 2334 99467 3782
gene2 3889 4893 22891
"
#get gene names ignoring cohort row
myCols <- names(read.table(text = myFile, nrows = 1, header = TRUE))
d1 <- read.table(text = myFile, skip = 2, col.names = myCols)
#get cohort row as vector
cohort <- read.table(text = myFile, skip = 1, nrows = 1)
cohort <- as.numeric(cohort[, -1])
Now convert gene dataset into a matrix, with rownames as gene names and column names as sample IDs. It will make future calculations/subsets more efficient.
m <- as.matrix(d1[, -1])
dimnames(m) <- list(d1[, 1], colnames(d1)[ -1 ])
m
# sample1 sample2 sample3
# gene1 2334 99467 3782
# gene2 3889 4893 22891
For example, to get gene1 for samples from cohort 1:
m[ "gene1", cohort == 1]
# [1] 99467