How to recode multiple columns in R efficiently?-CodePudding

I need to recode some data. Firstly, iImagine that the the original data looks something like this

A data.frame: 6 × 5
 col1    col2    col3    col4    col5
<chr>   <chr>   <chr>   <chr>   <chr>
   s1  414234  244575  539645  436236
   s2      NA  512342  644252  835325
   s3      NA      NA  816747  475295
   s4      NA      NA      NA  125429
   s5      NA      NA      NA      NA
   s6  617465  844526      NA  194262

which, secondly, is transformed into

A data.frame: 6 × 5
 col1    col2    col3    col4    col5
<chr>   <int>   <int>   <int>   <int>
   s1       4       2       5       4
   s2      NA       5       6       8
   s3      NA       NA      8       4
   s4      NA       NA     NA       1
   s5      NA       NA     NA      NA
   s6       6        8     NA       1

because I am going to recode everything according to the first digit. When, thirdly, recoded (see recoding pattern in MWE below) it should look like this

A data.frame: 6 × 5
 col1    col2    col3    col4    col5
<chr>   <int>   <int>   <int>   <int>
   s1       3       1       3       3
   s2      NA       3       4       5
   s3      NA       NA      5       3
   s4      NA       NA     NA       1
   s5      NA       NA     NA      NA
   s6       4        5     NA       1

and, fourthly, entire rows should be removed if all columns except the first one is empty, that is

A data.frame: 6 × 5
 col1    col2    col3    col4    col5
<chr>   <int>   <int>   <int>   <int>
   s1       3       1       3       3
   s2      NA       3       4       5
   s3      NA       NA      5       3
   s4      NA       NA     NA       1
   s6       4        5     NA       1

which is the ultimate data.

The first and second step were easily implemented but I struggle with the third and fourth step since I am new to R (see MWE below). For the third step, I tried to use mutate over multiple columns but Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "c('integer', 'numeric')" appeared. The fourth step is easily implemented in Python with thresh but I am not sure if there is an equivalent in R.

How is this possible? Also, I work with huge data, so time-efficient solutions would also be highly appreciated.

library(dplyr)

df <- data.frame(
    col1 = c("s1", "s2", "s3", "s4", "s5", "s6"),
    col2 = c("414234", NA, NA, NA, NA, "617465"),
    col3 = c("244575", "512342", NA, NA, NA, "844526"),
    col4 = c("539645", "644252", "816747", NA, NA, NA),
    col5 = c("436236", "835325", "475295", "125429", NA, "194262")
    )

n = ncol(df)

for (i in colnames(df[2:n])) {
    df[, i] = strtoi(substr(df[, i], 1, 1))
}

for (i in colnames(df[2:n])) {
    df[, i] %>% mutate(i=recode(i, "0": 1, "1": 1, "2": 1, "3": 2, "4": 3, "5": 3, "6": 4, "7": 5, "8": 5))
}

CodePudding user response：

Base R way:

# cut out just the numeric columns
df2 <- as.matrix(df[, -1])
# first digits
df2[] <- substr(df2, 1, 1)
mode(df2) <- 'numeric'
# recode
df2[] <- c(1, 1, 1, 2, 3, 3, 4, 5, 5)[df2 1]
# write back into the original data frame
df[, -1] <- df2
# remove rows with NAs only
df <- df[apply(df[, -1], 1, \(x) !all(is.na(x))), ]

df
#   V1 V2 V3 V4 V5
# 1 s1  3  1  3  3
# 2 s2 NA  3  4  5
# 3 s3 NA NA  5  3
# 4 s4 NA NA NA  1
# 6 s6  4  5 NA  1

As you can see, it is not necessary to do the operations column-wise as they can be performed en bloc, which will be more efficient.

CodePudding user response：

You can do this with a combination of tidyverse packages. We generally avoid for loops in R, unless we really need them. It's almost always preferable to vetorise.

library(dplyr)
library(stringr) # for str_sub
library(purrr)   # for negate

mat = matrix(c(     "s1",     "s2",     "s3",     "s4",     "s5",     "s6",
                    "414234",       NA,       NA,       NA,       NA, "617465", 
                    "244575", "512342",       NA,       NA,       NA, "844526",
                    "539645", "644252", "816747",       NA,       NA,       NA,
                    "436236", "835325", "475295", "125429",       NA, "194262"),
             nrow=6,
             ncol=5
)

df <- as.data.frame(mat)

## Step 1: Extract first character of each element
df <- mutate(df, across(V2:V5, str_sub, 1, 1))
head(df)
#>   V1   V2   V3   V4   V5
#> 1 s1    4    2    5    4
#> 2 s2 <NA>    5    6    8
#> 3 s3 <NA> <NA>    8    4
#> 4 s4 <NA> <NA> <NA>    1
#> 5 s5 <NA> <NA> <NA> <NA>
#> 6 s6    6    8 <NA>    1

## Step 3: Recode
df <- mutate(df, 
             across(V2:V5, 
             recode,
             `0` = "1", `1` = "1", `2` = "1", `3` = "2", 
             `4` = "3", `5` = "3", `6` = "4", `7` = "5", `8` = "5"
             ))

## Step 2: convert all columns to numeric
df <- mutate(df, across(V2:V5, as.numeric))

head(df)
#>   V1 V2 V3 V4 V5
#> 1 s1  3  1  3  3
#> 2 s2 NA  3  4  5
#> 3 s3 NA NA  5  3
#> 4 s4 NA NA NA  1
#> 5 s5 NA NA NA NA
#> 6 s6  4  5 NA  1

## Step 4: filter all rows where every value is numeric
## By purrr::negate()-ing is.na, we can select rows only rows where 
## at least one value is not missing

df <- filter(df, if_any(V2:V5, negate(is.na)))
df
#>   V1 V2 V3 V4 V5
#> 1 s1  3  1  3  3
#> 2 s2 NA  3  4  5
#> 3 s3 NA NA  5  3
#> 4 s4 NA NA NA  1
#> 5 s6  4  5 NA  1

^{Created on 2022-12-13 with reprex v2.0.2}

CodePudding user response：

This one using fancy math

df |>
    pivot_longer(col2:col5, values_to = "val", names_to = "col") |>
    mutate(val = map_dbl(as.integer(val),
                         ~c(1, 1, 1, 2, 3, 3, 4, 5, 5)[.x %/% 10^trunc(log10(.x))  1])) |>
    filter(!is.na(val)) |>
    pivot_wider(values_from = val, names_from = col )


##>   # A tibble: 5 × 5
##>   col1   col2  col3  col4  col5
##>   <chr> <dbl> <dbl> <dbl> <dbl>
##> 1 s1        3     1     3     3
##> 2 s2       NA     3     4     5
##> 3 s3       NA    NA     5     3
##> 4 s4       NA    NA    NA     1
##> 5 s6        4     5    NA     1