Home > OS >  R: Using loop to calculate reads per million (number multiplied by 10^6 divided by the sum of the nu
R: Using loop to calculate reads per million (number multiplied by 10^6 divided by the sum of the nu

Time:11-11

I am a junior bioinformatician / data scientist. I am trying to improve my programming skills in R, especially loops.

I wanted to write a script in which I would calculate RPM (reads per million) - a normalized value for read counts in metagenomics. This value is calculated as follows: number of reads * 10^6 / total number of reads in a sample (thus in a column).

To make it more clear, here is the R code that may be used to calculate it manually (thus without a loop):

`dat <- data.frame(x = c(50, 35, 78, 66), y = c(45, 34, 10, 20)) # random data frame

dat # view the data frame
# x  y
# 1 50 45
# 2 35 34
# 3 78 10
# 4 66 20

sum_of_the_column_x <- sum(dat$x) # calculate the sum of column x
sum_of_the_column_y <- sum(dat$y) # calculate the sum of column y
dat$rpm_x <- (dat$x * 10^6) / sum_of_the_column_x # calculate the rpm of x, add to the data frame
dat$rpm_y <- (dat$y * 10^6) / sum_of_the_column_y # calculate the rpm of y, add to the data frame

dat # view it again

# x  y    rpm_x     rpm_y
# 1 50 45 218340.6 412844.04
# 2 35 34 152838.4 311926.61
# 3 78 10 340611.4  91743.12
# 4 66 20 288209.6 183486.24


# when you sum up the rpm_x or rpm_y column, it should give 10^6.`

I would like to perform this calculation on many columns at the same time (I have sometimes more than 100 samples, thus columns, in a dataframe). Because it is so many columns, I wanted to create a for loop. I tried a few ways, but none of them really worked... (see below)

I feel like I am very close to the answer, but I ran out of ideas. I would be grateful for hints how to solve this issue. Also, please let me know in case sth is unclear in my question. Thank you so much!

`### WHAT IVE TRIED

dat <- data.frame(x = c(50, 35, 78, 66), y = c(45, 34, 10, 20)) # random data frame

dat # view the data frame
# x  y
# 1 50 45
# 2 35 34
# 3 78 10
# 4 66 20

### TRY 1

newdata <- data.frame(x = NA, y = NA)

for (i in dat[,1:2]) {
    newdata <- as.data.frame(cbind((i * 1e6 / sum(i))))
  }

newdata

# V1
# 1 412844.04
# 2 311926.61
# 3  91743.12
# 4 183486.24

### for some reason only the second column got calculated...

### TRY 2

newdat <- NA

for (i in dat[,1:2]) {
  for (j in dat[,1:2]) {
  newdat[i] <- i * 10^6
  }
}
  
### now it made a huge table with a lot of NAs.

### TRY 3 
# now I had to make a "newdata" dataframe with four rows of NAs, otherwise it doesn't work

newdata <- data.frame(x = rep(NA, 4), y = rep(NA, 4)) 


for(i in dat[,1:2]) { # for every number i in column
  for(j in colnames(newdata[,1:2])) { # for every column of newdata
    for(k in colnames(dat[,1:2])) { # for every column of dat
  newdata[j] <- i * 10^6  / sum(dat[[k]])
    }
  }  
}

newdata

#    x     y
# 1 412844.04 412844.04
# 2 311926.61 311926.61
# 3  91743.12  91743.12
# 4 183486.24 183486.24

### Again wrong... Now it is the same column two times.


### TRY 4

newdata <- data.frame(x = rep(NA, 4), y = rep(NA, 4))

for(i in dat[,1:2]) { # for every number i in columns
  for(j in colnames(dat[,1:2])) { # for columns of dat
      newdata[j] <- i * 10^6  / sum(dat[[j]])
    }
  }  

newdata

#  x         y
# 1 196506.55 412844.04
# 2 148471.62 311926.61
# 3  43668.12  91743.12
# 4  87336.24 183486.24


### Again wrong - the results of the x are incorrect`

CodePudding user response:

Unnecessary forloops slow things down. There are many other ways of doing this. One is

newdat <- dat * 10^6 / rep(colSums(dat), each=nrow(dat))

which gives what I think you want

> newdat
         x         y
1 218340.6 412844.04
2 152838.4 311926.61
3 340611.4  91743.12
4 288209.6 183486.24
  • Related