Home > database >  How to create a separate dataframe?
How to create a separate dataframe?

Time:11-29

I have input data which looks like this:

DF()

**symbol sample1 sample2 sample3**
Cohort 0 1 0
gene1 2334 99467 3782
gene2 3889 4893 22891

and I want to separate "Cohort" and the column names and make a separate data frame. Something like this:

symbol Cohort
sample1 0
sample2 1

I tried this:

DF<- data %>% filter(row_number() == 1) 
data1<-t(DF)

but got this:

V1
symbol Cohort
sample1 0
sample2 1

Can somebody help me out?

CodePudding user response:

your question is a bit tough to parse, but I think the code below should help.

I've used tidyverse which is a common package for reshaping and tidying data (https://r4ds.had.co.nz/ is a good resource for this and more).

There are alternatives such as reshape() in base R, though I'm not very familiar with that.

What I'm doing is:

  • Loading tidyverse with library()
  • Creating your sample dataframe using tribble()

On this df:

  • pivot_longer() stacks the columns defined by the cols() argument
    • starts_with() selects every column whose name begins with "sample"
  • filter() removes any column where "Symbol" doesn't equal "Cohort"
  • select() chooses specific columns and renames in a single step
library(tidyverse)

df <- tribble(
  ~"symbol", ~"sample1", ~"sample2", ~"sample3",
  "Cohort", 0, 1, 0,
  "gene1", 2334, 99467, 3782,
  "gene2", 3889, 4893, 22891
)

df %>% 
  pivot_longer(
    cols = starts_with("sample")
  ) %>% 
  filter(symbol == "Cohort") %>% 
  select(symbol = name,
         cohort = value)

CodePudding user response:

The data set is quite messy, but you can do:

library(dplyr)
dat %>% 
  filter(symbol == "Cohort") %>% 
  t() %>% as.data.frame() %>% 
  tibble::rownames_to_column() %>% 
  janitor::row_to_names(1)

   symbol Cohort
2 sample1      0
3 sample2      1
4 sample3      0

CodePudding user response:

Read the file twice to get genes and cohort rows separately, for example:

myFile = "symbol    sample1 sample2 sample3
Cohort  0   1   0
gene1   2334    99467   3782
gene2   3889    4893    22891
"

#get gene names ignoring cohort row
myCols <- names(read.table(text = myFile, nrows = 1, header = TRUE))
d1 <- read.table(text = myFile, skip = 2, col.names = myCols)

#get cohort row as vector
cohort <- read.table(text = myFile, skip = 1, nrows = 1)
cohort <- as.numeric(cohort[, -1])

Now convert gene dataset into a matrix, with rownames as gene names and column names as sample IDs. It will make future calculations/subsets more efficient.

m <- as.matrix(d1[, -1])
dimnames(m) <- list(d1[, 1], colnames(d1)[ -1 ])
m
#       sample1 sample2 sample3
# gene1    2334   99467    3782
# gene2    3889    4893   22891

For example, to get gene1 for samples from cohort 1:

m[ "gene1", cohort == 1]
# [1] 99467
  • Related