I have two big dataframes. I will use dummy data to explain what I need.
Say I have this gene expression data, lets call it df1
:
structure(list(Sample1 = c(102, 34, 77, 13, 10), Sample2 = c(1000,
23, 199, 0, 1), Sample3 = c(293, 234, 891, 230, 2002), Sample4 = c(672,
756, 431, 293, 0), Sample5 = c(28, 2910, 0, 120, 129)), class = "data.frame", row.names = c("GeneA",
"GeneB", "GeneC", "GeneD", "GeneE"))
And I have the metadata, which is additional information about the samples in df1
. Lets call it df2
:
structure(list(Age = c(54, 70, 49, 23, 90), Patient = c(10, 10,
42, 108, 20)), class = "data.frame", row.names = c("Sample1",
"Sample2", "Sample3", "Sample4", "Sample5"))
The colnames in df1
are the same as the rownames as df2
.
If you look at df2
, you'll see the column Patient
. I want for the samples that have the same value in the Patient column, to calculate their mean and make them one united sample in df1
.
So here, both Sample1
and Sample2
have the value 10 in the Patient column (in df2
). Therefore, those two columns in df1
, I want to make them one column, which is their average. Please note that This is just a dummy, so there might be more than only two samples that share the same value in this column. Sample5 or Sample6 and Sample7 might also share the same value.
What do I mean their average? That is to calculate the sum of each row individually and devide by two.
So here, the first row (for GeneA) for this new united sample would be: (102 1000)/2 = 551
The next is (34 23)/2 = 28.5
CodePudding user response:
We may split
the row names of 'df2' by the 'Patient' column into a list
. Then, if the length
of the list elements are greater than 1, get the average by looping over the list and applying rowMeans
and then create new columns in 'df1'
lst1 <- split(row.names(df2), df2$Patient)
names(lst1) <- paste0(sapply(lst1, paste, collapse = "_"), "_average")
i1 <- lengths(lst1) > 1
df1[names(lst1)[i1]] <-lapply(lst1[i1],
function(nm) rowMeans(df1[nm], na.rm = TRUE))
-output
> df1
Sample1 Sample2 Sample3 Sample4 Sample5 Sample1_Sample2_average
GeneA 102 1000 293 672 28 551.0
GeneB 34 23 234 756 2910 28.5
GeneC 77 199 891 431 0 138.0
GeneD 13 0 230 293 120 6.5
GeneE 10 1 2002 0 129 5.5