I am a junior bioinformatician / data scientist. I am trying to improve my programming skills in R, especially loops.
I wanted to write a script in which I would calculate RPM (reads per million) - a normalized value for read counts in metagenomics. This value is calculated as follows: number of reads * 10^6 / total number of reads in a sample (thus in a column).
To make it more clear, here is the R code that may be used to calculate it manually (thus without a loop):
`dat <- data.frame(x = c(50, 35, 78, 66), y = c(45, 34, 10, 20)) # random data frame
dat # view the data frame
# x y
# 1 50 45
# 2 35 34
# 3 78 10
# 4 66 20
sum_of_the_column_x <- sum(dat$x) # calculate the sum of column x
sum_of_the_column_y <- sum(dat$y) # calculate the sum of column y
dat$rpm_x <- (dat$x * 10^6) / sum_of_the_column_x # calculate the rpm of x, add to the data frame
dat$rpm_y <- (dat$y * 10^6) / sum_of_the_column_y # calculate the rpm of y, add to the data frame
dat # view it again
# x y rpm_x rpm_y
# 1 50 45 218340.6 412844.04
# 2 35 34 152838.4 311926.61
# 3 78 10 340611.4 91743.12
# 4 66 20 288209.6 183486.24
# when you sum up the rpm_x or rpm_y column, it should give 10^6.`
I would like to perform this calculation on many columns at the same time (I have sometimes more than 100 samples, thus columns, in a dataframe). Because it is so many columns, I wanted to create a for loop. I tried a few ways, but none of them really worked... (see below)
I feel like I am very close to the answer, but I ran out of ideas. I would be grateful for hints how to solve this issue. Also, please let me know in case sth is unclear in my question. Thank you so much!
`### WHAT IVE TRIED
dat <- data.frame(x = c(50, 35, 78, 66), y = c(45, 34, 10, 20)) # random data frame
dat # view the data frame
# x y
# 1 50 45
# 2 35 34
# 3 78 10
# 4 66 20
### TRY 1
newdata <- data.frame(x = NA, y = NA)
for (i in dat[,1:2]) {
newdata <- as.data.frame(cbind((i * 1e6 / sum(i))))
}
newdata
# V1
# 1 412844.04
# 2 311926.61
# 3 91743.12
# 4 183486.24
### for some reason only the second column got calculated...
### TRY 2
newdat <- NA
for (i in dat[,1:2]) {
for (j in dat[,1:2]) {
newdat[i] <- i * 10^6
}
}
### now it made a huge table with a lot of NAs.
### TRY 3
# now I had to make a "newdata" dataframe with four rows of NAs, otherwise it doesn't work
newdata <- data.frame(x = rep(NA, 4), y = rep(NA, 4))
for(i in dat[,1:2]) { # for every number i in column
for(j in colnames(newdata[,1:2])) { # for every column of newdata
for(k in colnames(dat[,1:2])) { # for every column of dat
newdata[j] <- i * 10^6 / sum(dat[[k]])
}
}
}
newdata
# x y
# 1 412844.04 412844.04
# 2 311926.61 311926.61
# 3 91743.12 91743.12
# 4 183486.24 183486.24
### Again wrong... Now it is the same column two times.
### TRY 4
newdata <- data.frame(x = rep(NA, 4), y = rep(NA, 4))
for(i in dat[,1:2]) { # for every number i in columns
for(j in colnames(dat[,1:2])) { # for columns of dat
newdata[j] <- i * 10^6 / sum(dat[[j]])
}
}
newdata
# x y
# 1 196506.55 412844.04
# 2 148471.62 311926.61
# 3 43668.12 91743.12
# 4 87336.24 183486.24
### Again wrong - the results of the x are incorrect`
CodePudding user response:
Unnecessary for
loops slow things down. There are many other ways of doing this. One is
newdat <- dat * 10^6 / rep(colSums(dat), each=nrow(dat))
which gives what I think you want
> newdat
x y
1 218340.6 412844.04
2 152838.4 311926.61
3 340611.4 91743.12
4 288209.6 183486.24