I have the following dataframe in R. Is there a way in which I can clean this column to have all of the "y" or "yes" entries displayed as "Yes" (similarly all the "nop" entries displayed as "No") in dplyr?
structure(list(has_elevator = c("Yes", "y", "y", "yes", "y",
"Yes", "yes", "y", "Yes", "yes", "yes", "Yes", "Yes", "y", "Yes",
"No", "Yes", "No", "y", "nop", "Yes", "yes", "Yes", "No", "Yes",
"y", "Yes", "yes", "nop", "yes", "Yes", "nop", "yes", "Yes",
"y", "y", "Yes", "no", "y", "Yes", "nop", "y", "y", "y", "No",
"no", "y", "y", "Yes", "no")), class = "data.frame", row.names = c(NA,
-50L))
CodePudding user response:
Here is an alternative approach:
We could use str_detect
with its argument ignore_case = T
wrapped in an ifelse
statement.
library(dplyr)
library(stringr)
df %>%
mutate(has_elevator = ifelse(str_detect(has_elevator, regex('y', ignore_case = T)), "Yes", "No"))
has_elevator
1 Yes
2 Yes
3 Yes
4 Yes
5 Yes
6 Yes
7 Yes
8 Yes
9 Yes
10 Yes
11 Yes
12 Yes
13 Yes
14 Yes
15 Yes
16 No
17 Yes
18 No
19 Yes
20 No
21 Yes
22 Yes
23 Yes
24 No
25 Yes
26 Yes
27 Yes
28 Yes
29 No
30 Yes
31 Yes
32 No
33 Yes
34 Yes
35 Yes
36 Yes
37 Yes
38 No
39 Yes
40 Yes
41 No
42 Yes
43 Yes
44 Yes
45 No
46 No
47 Yes
48 Yes
49 Yes
50 No
CodePudding user response:
You can use case_when()
within mutate()
to recode your variable. As I also found that you had some values no
rather than No
, I also recoded those for you.
# Your example data
df <- structure(list(has_elevator = c("Yes", "y", "y", "yes", "y",
"Yes", "yes", "y", "Yes", "yes", "yes", "Yes", "Yes", "y", "Yes",
"No", "Yes", "No", "y", "nop", "Yes", "yes", "Yes", "No", "Yes",
"y", "Yes", "yes", "nop", "yes", "Yes", "nop", "yes", "Yes",
"y", "y", "Yes", "no", "y", "Yes", "nop", "y", "y", "y", "No",
"no", "y", "y", "Yes", "no")), class = "data.frame", row.names = c(NA,
-50L))
Using case_when()
library(dplyr)
# Using case_when()
df_new <- df %>% mutate(
has_elevator = case_when(
has_elevator %in% c("y", "yes") ~ "Yes",
has_elevator %in% c("nop", "no") ~ "No",
TRUE ~ has_elevator
)
)
df_new$has_elevator %>% table()
#> .
#> No Yes
#> 11 39
Using recode()
library(dplyr)
df_new <- df %>% mutate(
has_elevator = recode(has_elevator, y = "Yes", yes = "Yes", nop = "No", no = "No")
)
df_new$has_elevator %>% table()
#> .
#> No Yes
#> 11 39
Combining string substitution with either function
You can skip recoding values to the proper case with a regular expression that automatically capitalizes the first letter of the string, whatever it is. This avoids possible oversight of case of values.
This is also a base
approach that doesn't require the stringr
package.
df_new <- df %>% mutate(
has_elevator = case_when(
has_elevator %in% c("y") ~ "Yes",
has_elevator %in% c("no") ~ "No",
TRUE ~ has_elevator),
has_elevator = has_elevator %>% sub('^(\\w?)', '\\U\\1', ., perl=T)
)