Home > Software design >  How do I set multiple existing columns in a dataframe to factors?
How do I set multiple existing columns in a dataframe to factors?

Time:04-04

Basically, suppose I have 10 columns (i.e. a-j) and columns b and g are the columns that are currently numeric and that I want to set to factors. How to set these specific columns as factors rather than doing as.factor() and adding them to my existing dataframe (my dataframe for this example is dataInExample)?

The libraries I am using are dplyr and tidyverse.

dataInExample <- as.factor(dataInExample$b) #rather than this and then add it again
dataInExample <- some R code that converts columns b and g to factors in dataInExample, would select() be used here? how would I then use it to convert these specific columns into factors without having to readd them to my data frame?

My original table (all columns except b and g are numeric; b and g have two levels representing high/low and good/bad, respectively):

a b c d e f g h i j
1 1 2 1 1 1 1 1 8 1
1 2 1 5 1 1 2 6 1 1

CodePudding user response:

The tidyverse has the across syntax , which you can use to modify multiple variable at once. For example, you can select columns to change based on condition (e.g., all numeric variables), or specific column names or locations. Examples of each using the iris data are below.

library(tidyverse)

# 1. Convert all numeric columns to factor
numeric_to_factor = iris %>% 
  mutate(across(where(is.numeric), factor))

# 2. Convert columns by name to factor
factor_cols = c("Sepal.Length", "Sepal.Width",  "Petal.Length", "Petal.Width")
names_to_factor = iris %>% 
  mutate(across(all_of(factor_cols), factor))

# 3. Convert columns 1 to 4 (location) to factor
factor_cols = 1:4
location_to_factor = iris %>% 
  mutate(across(all_of(factor_cols), factor))

CodePudding user response:

This should get what you need:

# load library
library(tidyverse)

# replicate example data
dat <- tibble::tribble(
         ~a, ~b, ~c, ~d, ~e, ~f, ~g, ~h, ~i, ~j,
         1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 8L, 1L,
         1L, 2L, 1L, 5L, 1L, 1L, 2L, 6L, 1L, 1L
         )

# mutate across all of specified columns
dat |> 
  mutate(across(all_of(c("b","g")), as.factor))

Resulting in:

# A tibble: 2 × 10
      a b         c     d     e     f g         h     i     j
  <int> <fct> <int> <int> <int> <int> <fct> <int> <int> <int>
1     1 1         2     1     1     1 1         1     8     1
2     1 2         1     5     1     1 2         6     1     1

CodePudding user response:

The first argument in across is the columns to change, and the second argument is the function to be applied on the columns selected.

I noticed that some other answers used all_of() inside across, which is not necessary in this case. It's only used when you want to make sure ALL columns supplied in all_of() are present in the current dataframe. If any of the column in all_of() is not present, it will throw an error.

library(dplyr)

df %>% mutate(across(c(b, g), as.factor))

# A tibble: 2 × 10
      a b         c     d     e     f g         h     i     j
  <int> <fct> <int> <int> <int> <int> <fct> <int> <int> <int>
1     1 1         2     1     1     1 1         1     8     1
2     1 2         1     5     1     1 2         6     1     1
  • Related