Home > database >  how to correct a mistake made in the levels of a factor variable?
how to correct a mistake made in the levels of a factor variable?

Time:10-08

let's say I have this dataframe

d = data.frame(x = c("1","1 2", "1 3", "2 3", "3", "4"))
d

and it has the variable x as a factor

d$x = as.factor(d$x)

However I discover an error in three of the levels that I wrote.

So I want to replace the values of these variables and their levels as follows :

I want to replace 1 2 with 1

I want to replace 1 3 with 1

I want to replace 2 3 with 2

levels(d$x)

so I want to correct it. when using the following method :

d$x[which(d$x == "1 2")] <- "1"
d$x[which(d$x == "1 3")] <- "1"
d$x[which(d$x == "2 3")] <- "2"

It create levels as follows

1 1 1 2 3 4

What I wish is the levels as follows

1 2 3 4

What should I do to handle this problem ? thanks

CodePudding user response:

Another option is turning back to character while modifying:

d$x <- as.character(d$x)
d$x <- factor(sub(" . ", "", d$x))

d$x
# [1] 1 1 1 2 3 4
# Levels: 1 2 3 4

CodePudding user response:

You can use fct_collapse:

library(dplyr)
library(forcats)
d %>% 
  mutate(x = fct_collapse(x, 
                          "1" = c("1", "1 2", "1 3"),
                          "2" = c("2", "2 3")))
  x
1 1
2 1
3 1
4 2
5 3
6 4

CodePudding user response:

How about this? You split the text by the space and then you unnest the lists to long format. This will work if there are many issues. This also assumes that there is a space that defines the error as per your example.

library(tidyverse)

d <-  data.frame(x = c("1","2", "3 4", "5", "6"))

d |>
  mutate(x = str_split(x, pattern = "\\s")) |>
  unnest_longer(x)
#> # A tibble: 6 x 1
#>   x    
#>   <chr>
#> 1 1    
#> 2 2    
#> 3 3    
#> 4 4    
#> 5 5    
#> 6 6

Edit based on comments: Here are two methods. One with tidyverse and one using base R.

library(tidyverse)
  
d <-  data.frame(x = c("1","2", "3 4", "5", "6"))

d |>
  mutate(x = str_remove(x, "\\s4$")) 
#>   x
#> 1 1
#> 2 2
#> 3 3
#> 4 5
#> 5 6

d$x[which(d$x == "3 4")] <- "3"
d
#>   x
#> 1 1
#> 2 2
#> 3 3
#> 4 5
#> 5 6

Another edit based on more info:

d = data.frame(x = c("1","1 2", "1 3", "2 3", "3", "4"))

d$x <- as.factor(gsub("(\\d )\\s\\d $", "\\1", d$x))

d
#>   x
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 3
#> 6 4

levels(d$x)
#> [1] "1" "2" "3" "4"

CodePudding user response:

There is also a dedicated function recode() in dplyr for this purpose:

library(dplyr)

## initial factor
x <- factor(c("1","1 2", "1 3", "2 3", "3", "4"))
levels(x)
#> [1] "1"   "1 2" "1 3" "2 3" "3"   "4"

## edited factor
recode(x, "1 2" = "1", "1 3" = "1", "2 3" = "2")
#> [1] 1 1 1 2 3 4
#> Levels: 1 2 3 4

P.S.: you should not edit your question in such a way that it invalidates (previously valid) answers.

CodePudding user response:

Copying from my answer to a recent question:

Under the hood, a factor array is an integer array with labels (levels). You can rename the labels alone without touching the underlying array.

d = data.frame(x = factor(c("1","1 2", "1 3", "2 3", "3", "4")))
levels(d$x)
[1] "1"   "1 2" "1 3" "2 3" "3"   "4" 

levels(d$x) <- c(1, 1, 1, 2, 3, 4)
levels(d$x)
[1] "1" "2" "3" "4"

d$x
[1] 1 1 1 2 3 4
Levels: 1 2 3 4

If you have more levels, and don't want to risk a manual assignment, you can create a dictionary of replacement values

d = data.frame(x = factor(c("1","1 2", "1 3", "2 3", "3", "4")))
dict <- setNames(
    gsub(' .$', '', levels(d$x)), # remove spaces and any character after that
    levels(d$x)
)
dict
  1 1 2 1 3 2 3   3   4 
"1" "1" "1" "2" "3" "4" 

You can then use the dictionary to replace existing level labels with new ones

levels(d$x) <- dict[levels(d$x)]
d$x
[1] 1 1 1 2 3 4
Levels: 1 2 3 4
  • Related