I have two dataframes, one that has gene names and their counts, and a second dataframe that has the gene names and their ontological names. I want to update the gene names from df1
with the names they associate to in df2
.
Sample data:
df1 <- data.frame(ID=c("gene1","gene2","gene3"), sample1=c(1,0,50), sample2=c(0,0,0), sample3=c(45,56,11))
rownames(df1) <- df1$ID
df1$ID <- NULL
> df1
sample1 sample2 sample3
gene1 1 0 45
gene2 0 0 56
gene4 50 0 11
df2 <- data.frame(ID=c("gene1","gene2","gene3", "gene4"), name=c("hr1","gene2","exoc like exoc1 in drosophila", "ftp"), desc=c("protein","unknown","fake immunity known for fighting viruses", "like ftp1"))
rownames(df2) <- df2$ID
df2$ID <- NULL
> df2
name desc
gene1 hr1 protein
gene2 gene2 unknown
gene3 exoc like exoc1 in drosophila fake immunity known for fighting viruses
gene4 ftp like ftp1
What I want is for df1
row names to update using the names in "name" in df2
. df2
contains all the gene names and their ontological names in the first column; some of those genes are missing in df1
.
Expected output:
> df1.new
sample1 sample2 sample3
hr1 1 0 45
gene2 0 0 56
ftp 50 0 11
I'm not familiar with tidyverse to try and update names and the problem I am having is the way my dataframes are loaded, is I am trying to update index names. I've tried manipulating my dataframes using the only similar question I could find (R - replace specific values in df with values from other df by matching row names) but I am trying to update index row names.
I've tried variations of:
df1 <- df1[na.omit(match(rownames(df1), df2$name)),] # throws an error
library(dplyr)
library(tibble)
rownames_to_column(df1) %>% rows_update(df2 %>% rownames_to_column(df1), by ="rowname") %>% column_to_rownames(df1) # Error, Names repair functions must return a character vector
Having trouble because it's an index I want to match and update with a column in a second data frame.
CodePudding user response:
Another one (btw, your code does not match the dataframes):
> map = df2$name
> names(map) = rownames(df2)
> df1.new = df1
> rownames(df1.new) = map[rownames(df1)]
> df1.new
sample1 sample2 sample3
hr1 1 0 45
gene2 0 0 56
exoc 50 0 11
CodePudding user response:
The code you have to create df1
and df2
does not match the df1
and df2
that you show, but here is a way to get the result
column I think you want--you can then remove any columns you don't want.
library(dplyr)
library(tibble)
library(tidyr)
df1 %>%
rownames_to_column(var = "gene") %>%
left_join(
df2 %>% rownames_to_column(var = "gene"),
by = "gene"
) %>%
mutate(result = ifelse(desc == "unknown", gene, desc))
# gene sample1 sample2 sample3 name desc
# 1 gene1 1 0 45 hr1 protein
# 2 gene2 0 0 56 Unknown origin unknown
# 3 gene3 50 0 11 exoc like exoc1 in drosophila fake immunity known for fighting viruses
# result
# 1 protein
# 2 gene2
# 3 fake immunity known for fighting viruses
CodePudding user response:
Here is a slightly modified version of @Gregor Thomas answer:
library(tibble)
library(dplyr)
left_join(df1 %>%
rownames_to_column("gene"),
df2 %>%
rownames_to_column("gene"),
by="gene") %>%
column_to_rownames("name") %>%
select(starts_with("sample"))
sample1 sample2 sample3
hr1 1 0 45
gene2 0 0 56
ftp 50 0 11