Home > Software design >  Classify column values into new column with specific output
Classify column values into new column with specific output

Time:12-29

I want to classify the rows of a data frame based on a threshold applied to a given numeric reference column. If the reference column has a value below the threshold, then the result is 0, which I want to add to a new column. If the reference column value is over the threshold, then the new column will have value 1 in all consecutive rows with value over the threshold until a new 0 result comes up. If a new reference value is over the threshold then the value to add is 2, and so on.

If we set up the threshold > 2 then an example of what I would like to obtain is:

row reference result
1 2 0
2 1 0
3 4 1
4 3 1
5 1 0
6 6 2
7 8 2
8 4 2
9 1 0
10 3 3
11 6 3
row <- c(1:11)
reference <- c(2,1,4,3,1,6,8,4,1,3,6)
result <- c(0,0,1,1,0,2,2,2,0,3,3)
table <- cbind(row, reference, result)

Thank you!

CodePudding user response:

We can use run-length encoding (rle) for this.

The below assumes a data.frame:

r <- rle(quux$reference <= 2)
r$values <- ifelse(r$values, 0, cumsum(r$values))
quux$result2 <- inverse.rle(r)
quux
#    row reference result result2
# 1    1         2      0       0
# 2    2         1      0       0
# 3    3         4      1       1
# 4    4         3      1       1
# 5    5         1      0       0
# 6    6         6      2       2
# 7    7         8      2       2
# 8    8         4      2       2
# 9    9         1      0       0
# 10  10         3      3       3
# 11  11         6      3       3

Data

quux <- structure(list(row = 1:11, reference = c(2, 1, 4, 3, 1, 6, 8, 4, 1, 3, 6), result = c(0, 0, 1, 1, 0, 2, 2, 2, 0, 3, 3)), row.names = c(NA, -11L), class = "data.frame")

CodePudding user response:

As noted in the comments by @Sotos, would consider alternative name for your object.

Since it wasn't clear if data.frame or matrix, assume we have a data.frame df based on your data:

df <- as.data.frame(table)

And have a threshold of 2:

threshold = 2

You can adapt this solution by @flodel:

df$new_result = ifelse(
  x <- reference > threshold, 
  cumsum(c(x[1], diff(x) == 1)), 
  0)
df

In this case, the diff(x) will include a vector, where values of 1 indicate where result should be increased by cumsum (in the sample data, this occurs in rows 3, 6, and 10). These are transitions from FALSE to TRUE (0 to 1), where reference goes from below to above threshold. Note that x[1] is added/combined since the diff values will be 1 element shorter in length.

Using the ifelse, these new incremental values only apply to those where reference exceeds threshold, otherwise set at 0.

Output

   row reference result new_result
1    1         2      0          0
2    2         1      0          0
3    3         4      1          1
4    4         3      1          1
5    5         1      0          0
6    6         6      2          2
7    7         8      2          2
8    8         4      2          2
9    9         1      0          0
10  10         3      3          3
11  11         6      3          3
  • Related