Binarize data across multiple columns-CodePudding

Hi I have dataframe and in 13 columns of my df values are coded from 1 to 5. I want to replace 1,2 with 0 and 3,4 with 1 and drop 5. How I can make a change in my current data without mutating? Because there are 13 columns needed to be changed

CodePudding user response：

You can try the code below (borrow data from @benson23, thanks!)

> df[] <- (df >= 3) * NA^(df == 5)

> df
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
1  1 NA NA  0  0  1  0 NA NA   0   0   0  NA
2  1  1  1  0  0 NA  0 NA NA   0   0   0  NA
3  0  0  1 NA  1  1  0  1  1   1   1   0   1
4  0  0  0  1  1  0  1  0 NA   0   1   0   0
5  1  1  1  0  0 NA  1  0  0  NA   1   1   1

df >=3 yields boolean matrix consisting of TRUE or FALSE
NA^(df == 5) yields a matrix consisting of NA or 1, since NA^0 = 1 and 1^NA = NA, and this matrix will play as a mask
The element-wise product between two matrix retains the non-NA entries and also turns boolean values to numerics

CodePudding user response：

I would either perform the conversion as a two-step process (since there are two rules), or write a function that encapsulates your rules, and apply those. I’ll be using ‘dplyr’ mutate in the following since that seems to be what you’re using:

Here’s the two-step process:

df |>
    mutate(across(everything(), ~ replace(.x, .x == 5L, NA))) |>
    mutate(across(everything(), ~ .x >= 3L))

And here it is using a function:

myrule = function (x) {
    if_else(x == 5L, NA, x >= 3L)
}

df |> mutate(across(everything(), myrule))

Here it is crucial that you give the function myrule a descriptive name that fits your problem domain.

CodePudding user response：

Let's say we have this stimulated dataframe:

set.seed(123)

df <- matrix(rep(sample(1:5, 5*13, replace = T)), ncol = 13) %>% as.data.frame()

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
1  2  1  5  5  5  4  5  3  1   1   1   2   4
2  4  3  3  2  4  3  5  4  3   2   3   1   1
3  3  5  4  1  4  3  4  2  3   3   4   4   2
4  5  3  3  2  5  2  4  2  2   2   1   5   2
5  5  3  1  5  4  1  1  2  1   5   3   2   5

Base R

We can first set df == 5 to NA, and use a logical expression to see if values are greater then or equal to 3 (proposed by @danlooo in the comment).

The (df >= 3) syntax is used to convert logical output of df >= 3 to integer.

df[df == 5] <- NA
df <- as.data.frame( (df >= 3))

dplyr

Or we can use the mutate with across combination in the dplyr package.

library(dplyr)

df <- df %>% mutate(across(everything(), ~case_when(.x %in% 1:2 ~ 0, 
                                                    .x %in% 3:4 ~ 1, 
                                                    TRUE ~ NA_real_)))

Output

df
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
1  0  0 NA NA NA  1 NA  1  0   0   0   0   1
2  1  1  1  0  1  1 NA  1  1   0   1   0   0
3  1 NA  1  0  1  1  1  0  1   1   1   1   0
4 NA  1  1  0 NA  0  1  0  0   0   0  NA   0
5 NA  1  0 NA  1  0  0  0  0  NA   1   0  NA

Data

Here's the dput(df) for easier data loading.

structure(list(V1 = c(2L, 4L, 3L, 5L, 5L), V2 = c(1L, 3L, 5L, 
3L, 3L), V3 = c(5L, 3L, 4L, 3L, 1L), V4 = c(5L, 2L, 1L, 2L, 5L
), V5 = c(5L, 4L, 4L, 5L, 4L), V6 = c(4L, 3L, 3L, 2L, 1L), V7 = c(5L, 
5L, 4L, 4L, 1L), V8 = c(3L, 4L, 2L, 2L, 2L), V9 = c(1L, 3L, 3L, 
2L, 1L), V10 = c(1L, 2L, 3L, 2L, 5L), V11 = c(1L, 3L, 4L, 1L, 
3L), V12 = c(2L, 1L, 4L, 5L, 2L), V13 = c(4L, 1L, 2L, 2L, 5L)), class = "data.frame", row.names = c(NA, 
-5L))