I need to recode some data. Firstly, iImagine that the the original data looks something like this
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <chr> <chr> <chr> <chr>
s1 414234 244575 539645 436236
s2 NA 512342 644252 835325
s3 NA NA 816747 475295
s4 NA NA NA 125429
s5 NA NA NA NA
s6 617465 844526 NA 194262
which, secondly, is transformed into
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 4 2 5 4
s2 NA 5 6 8
s3 NA NA 8 4
s4 NA NA NA 1
s5 NA NA NA NA
s6 6 8 NA 1
because I am going to recode everything according to the first digit. When, thirdly, recoded (see recoding pattern in MWE
below) it should look like this
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 3 1 3 3
s2 NA 3 4 5
s3 NA NA 5 3
s4 NA NA NA 1
s5 NA NA NA NA
s6 4 5 NA 1
and, fourthly, entire rows should be removed if all columns except the first one is empty, that is
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 3 1 3 3
s2 NA 3 4 5
s3 NA NA 5 3
s4 NA NA NA 1
s6 4 5 NA 1
which is the ultimate data.
The first and second step were easily implemented but I struggle with the third and fourth step since I am new to R
(see MWE
below). For the third step, I tried to use mutate
over multiple columns but Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "c('integer', 'numeric')"
appeared. The fourth step is easily implemented in Python
with thresh
but I am not sure if there is an equivalent in R
.
How is this possible? Also, I work with huge data, so time-efficient solutions would also be highly appreciated.
library(dplyr)
df <- data.frame(
col1 = c("s1", "s2", "s3", "s4", "s5", "s6"),
col2 = c("414234", NA, NA, NA, NA, "617465"),
col3 = c("244575", "512342", NA, NA, NA, "844526"),
col4 = c("539645", "644252", "816747", NA, NA, NA),
col5 = c("436236", "835325", "475295", "125429", NA, "194262")
)
n = ncol(df)
for (i in colnames(df[2:n])) {
df[, i] = strtoi(substr(df[, i], 1, 1))
}
for (i in colnames(df[2:n])) {
df[, i] %>% mutate(i=recode(i, "0": 1, "1": 1, "2": 1, "3": 2, "4": 3, "5": 3, "6": 4, "7": 5, "8": 5))
}
CodePudding user response:
Base R way:
# cut out just the numeric columns
df2 <- as.matrix(df[, -1])
# first digits
df2[] <- substr(df2, 1, 1)
mode(df2) <- 'numeric'
# recode
df2[] <- c(1, 1, 1, 2, 3, 3, 4, 5, 5)[df2 1]
# write back into the original data frame
df[, -1] <- df2
# remove rows with NAs only
df <- df[apply(df[, -1], 1, \(x) !all(is.na(x))), ]
df
# V1 V2 V3 V4 V5
# 1 s1 3 1 3 3
# 2 s2 NA 3 4 5
# 3 s3 NA NA 5 3
# 4 s4 NA NA NA 1
# 6 s6 4 5 NA 1
As you can see, it is not necessary to do the operations column-wise as they can be performed en bloc, which will be more efficient.
CodePudding user response:
You can do this with a combination of tidyverse packages. We generally avoid for loops in R, unless we really need them. It's almost always preferable to vetorise.
library(dplyr)
library(stringr) # for str_sub
library(purrr) # for negate
mat = matrix(c( "s1", "s2", "s3", "s4", "s5", "s6",
"414234", NA, NA, NA, NA, "617465",
"244575", "512342", NA, NA, NA, "844526",
"539645", "644252", "816747", NA, NA, NA,
"436236", "835325", "475295", "125429", NA, "194262"),
nrow=6,
ncol=5
)
df <- as.data.frame(mat)
## Step 1: Extract first character of each element
df <- mutate(df, across(V2:V5, str_sub, 1, 1))
head(df)
#> V1 V2 V3 V4 V5
#> 1 s1 4 2 5 4
#> 2 s2 <NA> 5 6 8
#> 3 s3 <NA> <NA> 8 4
#> 4 s4 <NA> <NA> <NA> 1
#> 5 s5 <NA> <NA> <NA> <NA>
#> 6 s6 6 8 <NA> 1
## Step 3: Recode
df <- mutate(df,
across(V2:V5,
recode,
`0` = "1", `1` = "1", `2` = "1", `3` = "2",
`4` = "3", `5` = "3", `6` = "4", `7` = "5", `8` = "5"
))
## Step 2: convert all columns to numeric
df <- mutate(df, across(V2:V5, as.numeric))
head(df)
#> V1 V2 V3 V4 V5
#> 1 s1 3 1 3 3
#> 2 s2 NA 3 4 5
#> 3 s3 NA NA 5 3
#> 4 s4 NA NA NA 1
#> 5 s5 NA NA NA NA
#> 6 s6 4 5 NA 1
## Step 4: filter all rows where every value is numeric
## By purrr::negate()-ing is.na, we can select rows only rows where
## at least one value is not missing
df <- filter(df, if_any(V2:V5, negate(is.na)))
df
#> V1 V2 V3 V4 V5
#> 1 s1 3 1 3 3
#> 2 s2 NA 3 4 5
#> 3 s3 NA NA 5 3
#> 4 s4 NA NA NA 1
#> 5 s6 4 5 NA 1
Created on 2022-12-13 with reprex v2.0.2
CodePudding user response:
This one using fancy math
df |>
pivot_longer(col2:col5, values_to = "val", names_to = "col") |>
mutate(val = map_dbl(as.integer(val),
~c(1, 1, 1, 2, 3, 3, 4, 5, 5)[.x %/% 10^trunc(log10(.x)) 1])) |>
filter(!is.na(val)) |>
pivot_wider(values_from = val, names_from = col )
##> # A tibble: 5 × 5
##> col1 col2 col3 col4 col5
##> <chr> <dbl> <dbl> <dbl> <dbl>
##> 1 s1 3 1 3 3
##> 2 s2 NA 3 4 5
##> 3 s3 NA NA 5 3
##> 4 s4 NA NA NA 1
##> 5 s6 4 5 NA 1