I have a dataset similar to the following that I am showing you
gene_name | gene_length | value1 | value2 | value3 |
---|---|---|---|---|
NameA | 1070 | 100 | 300 | 600 |
NameB | 110 | 200 | 600 | 1200 |
My goal is to create new columns with the results of the division of the values that are in the columns value1, value2, value3.... value-n by the values that are in the gene_length column.
Something like this:
gene_name | gene_length | value1 | value2 | value3 | value1_result | value2_result | value3_result |
---|---|---|---|---|---|---|---|
NameA | 1070 | 100 | 300 | 600 | 0.0934 | 0.2803 | 0.5607 |
NameB | 110 | 200 | 600 | 1200 | 1.8181 | 5.4545 | 10.9090 |
I could apply several mutate functions in R with few columns and rows, but the problem is that my dataset has more than 50 thousand rows and 21 columns.
How could this task be accomplished using the tidyverse more efficiently?
I have read that I could use the mutate function in conjunction with the across function, however I have not been able to get them to work together.
desired_df <- df %>%
mutate(across(.cols = 3:21, # 21 because of the 21 columns i have in my dataframe
# I need to specify a function to perform the division in the columns i want
# but i dont know how
.names = '{col}_value')) # names of new columns
CodePudding user response:
Loop across
the 'value' columns, create a lambda function (~
) to divide the column (.x
) by the 'gene_length' and modify the .names
to create new columns with _result
as suffix
library(dplyr)
desired_df <- df %>%
mutate(across(.cols = starts_with("value"), ~ .x/gene_length,
.names = '{.col}_result'))
-output
> desired_df
gene_name gene_length value1 value2 value3 value1_result value2_result value3_result
1 NameA 1070 100 300 600 0.09345794 0.2803738 0.5607477
2 NameB 110 200 600 1200 1.81818182 5.4545455 10.9090909
Or using data.table
library(data.table)
nm1 <- grep("value", names(df), value = TRUE)
setDT(df)[, paste0(nm1, "_result") := lapply(.SD, \(x)
x/gene_length), .SDcols = nm1]
data
df <- structure(list(gene_name = c("NameA", "NameB"), gene_length = c(1070L,
110L), value1 = c(100L, 200L), value2 = c(300L, 600L), value3 = c(600L,
1200L)), class = "data.frame", row.names = c(NA, -2L))