How to create a separate dataframe?-CodePudding

I have input data which looks like this:

DF()

**symbol	sample1	sample2	sample3**
Cohort	0	1	0
gene1	2334	99467	3782
gene2	3889	4893	22891

and I want to separate "Cohort" and the column names and make a separate data frame. Something like this:

symbol	Cohort
sample1	0
sample2	1

I tried this:

DF<- data %>% filter(row_number() == 1) 
data1<-t(DF)

but got this:

	V1
symbol	Cohort
sample1	0
sample2	1

Can somebody help me out?

CodePudding user response：

your question is a bit tough to parse, but I think the code below should help.

I've used tidyverse which is a common package for reshaping and tidying data (https://r4ds.had.co.nz/ is a good resource for this and more).

There are alternatives such as reshape() in base R, though I'm not very familiar with that.

What I'm doing is:

Loading tidyverse with library()
Creating your sample dataframe using tribble()

On this df:

pivot_longer() stacks the columns defined by the cols() argument
- starts_with() selects every column whose name begins with "sample"
filter() removes any column where "Symbol" doesn't equal "Cohort"
select() chooses specific columns and renames in a single step

library(tidyverse)

df <- tribble(
  ~"symbol", ~"sample1", ~"sample2", ~"sample3",
  "Cohort", 0, 1, 0,
  "gene1", 2334, 99467, 3782,
  "gene2", 3889, 4893, 22891
)

df %>% 
  pivot_longer(
    cols = starts_with("sample")
  ) %>% 
  filter(symbol == "Cohort") %>% 
  select(symbol = name,
         cohort = value)

CodePudding user response：

The data set is quite messy, but you can do:

library(dplyr)
dat %>% 
  filter(symbol == "Cohort") %>% 
  t() %>% as.data.frame() %>% 
  tibble::rownames_to_column() %>% 
  janitor::row_to_names(1)

   symbol Cohort
2 sample1      0
3 sample2      1
4 sample3      0

CodePudding user response：

Read the file twice to get genes and cohort rows separately, for example:

myFile = "symbol    sample1 sample2 sample3
Cohort  0   1   0
gene1   2334    99467   3782
gene2   3889    4893    22891
"

#get gene names ignoring cohort row
myCols <- names(read.table(text = myFile, nrows = 1, header = TRUE))
d1 <- read.table(text = myFile, skip = 2, col.names = myCols)

#get cohort row as vector
cohort <- read.table(text = myFile, skip = 1, nrows = 1)
cohort <- as.numeric(cohort[, -1])

Now convert gene dataset into a matrix, with rownames as gene names and column names as sample IDs. It will make future calculations/subsets more efficient.

m <- as.matrix(d1[, -1])
dimnames(m) <- list(d1[, 1], colnames(d1)[ -1 ])
m
#       sample1 sample2 sample3
# gene1    2334   99467    3782
# gene2    3889    4893   22891

For example, to get gene1 for samples from cohort 1:

m[ "gene1", cohort == 1]
# [1] 99467