Home > database >  R: reset values in data frame based on a value range
R: reset values in data frame based on a value range

Time:11-19

I have a data frame like so:

# generate data frame
df = as.data.frame(cbind(c('Chr1', 'Chr1', 'Chr1', 'Chr2', 'Chr2', 'Chr2', 'Chr3', 'Chr3', 'Chr4', 'Chr4', 'Chr5'), 
                         c(121, 1567, 2489, 23, 565, 1789, 551, 1987, 25, 2356, 1111)))
colnames(df) = c('Chr', 'Pos')
df$Pos = as.numeric(df$Pos)
df
    Chr  Pos
1  Chr1  121
2  Chr1 1567
3  Chr1 2489
4  Chr2   23
5  Chr2  565
6  Chr2 1789
7  Chr3  551
8  Chr3 1987
9  Chr4   25
10 Chr4 2356
11 Chr5 1111

Now, I want to change the values in the column Pos depending on its current value. For example, if the value in Pos <= 1000, it should get 500 assigned in the new column, if the value is <= 2000 but > 1000 it should get 1000 assigned in the new column, etc.

The easy approach with df would look like this:

# alter dataframe
df$Pos = ifelse(df$Pos <= 1000, 500, df$Pos)
df$Pos = ifelse(df$Pos <= 2000 & df$Pos > 1000, 1500, df$Pos)
df$Pos = ifelse(df$Pos <= 3000 & df$Pos > 2000, 2500, df$Pos)
df
    Chr  Pos
1  Chr1  500
2  Chr1 1500
3  Chr1 2500
4  Chr2  500
5  Chr2  500
6  Chr2 1500
7  Chr3  500
8  Chr3 1500
9  Chr4  500
10 Chr4 2500
11 Chr5 1500

This results in the desired output. However, my real dataset is much larger and I cannot add an extra condition for each range of values that I want to reset. I am therefore looking for a more efficient solution. This is my attempt at a more efficient solution:

# generate reference vectors
bin = seq(from = 1000, by = 1000, length.out = 3)
pos = seq(from = 500, by = 1000, length.out = 3)

# reset values
df$Pos = ifelse(df$Pos <= bin & df$Pos > bin-1000, pos, df$Pos)
df

However, this throws a warning message:

Warning messages:
1: In df$Pos <= bin :
  longer object length is not a multiple of shorter object length
2: In df$Pos > bin - 1000 :
  longer object length is not a multiple of shorter object length

And the output looks wrong (some values have been reset, others have not been):

> df
    Chr  Pos
1  Chr1  500
2  Chr1 1500
3  Chr1 2500
4  Chr2  500
5  Chr2  565
6  Chr2 1789
7  Chr3  500
8  Chr3 1500
9  Chr4   25
10 Chr4 2356
11 Chr5 1500

I also tried solving my issue with a Map function as well, but that did not work either. See below for my attempt with Map:

df2 = Map(function(bin, bin2, pos) {
  df2 = ifelse(df$Pos <= 1000 & df$Pos > bin2, pos, df$Pos)
}, bin, bin-1000, pos)
df2
[[1]]
 [1]  500 1567 2489  500  500 1789  500 1987  500 2356 1111

[[2]]
 [1]  121 1567 2489   23  565 1789  551 1987   25 2356 1111

[[3]]
 [1]  121 1567 2489   23  565 1789  551 1987   25 2356 1111

I feel like I am approaching this problem from a totally wrong angle. Anyone has any idea how to solve this piece of code?

CodePudding user response:

You may take help of cut or findInterval.

bin = c(0, seq(from = 1000, by = 1000, length.out = 3))
pos = seq(from = 500, by = 1000, length.out = 3)

df$new_value <- cut(df$Pos, bin, pos)
#cut returns factor output, to change to numbers use the below code
df$new_value <- as.numeric(as.character(df$new_value))
df

#    Chr  Pos new_value
#1  Chr1  121       500
#2  Chr1 1567      1500
#3  Chr1 2489      2500
#4  Chr2   23       500
#5  Chr2  565       500
#6  Chr2 1789      1500
#7  Chr3  551       500
#8  Chr3 1987      1500
#9  Chr4   25       500
#10 Chr4 2356      2500
#11 Chr5 1111      1500

For clarity and explanation of the answer, I have created a new column new_value you can replace the original column Pos if that is what you want to do.

  • Related