I have a data frame that looks like this :
names |
---|
MARY123L |
MARYL123.00 |
MARYNLO |
MARYNLA |
JOHN330 |
JOHNNLA |
JOHN123A |
JOHN123n456.00 |
GEORGEJ |
GEORGEJ |
GEORGEJ |
GEORGENLA |
i want to create a new column variable that will check each element in the column name and will return a word according to a condition :
- if the word in the column names ends with a letter to give me the word "table",
- f the word in the column names ends with a number to give me the word "chair"
- and if the word in the column names ends with a "NLA" or "NLO" to give me the word "clothing"
Ideally i want the new data frame to look like this:
names | var |
---|---|
MARY123L | table |
MARYL123.00 | chair |
MARYNLO | clothing |
MARYNLA | clothing |
JOHN330 | chair |
JOHNNLA | clothing |
JOHN123A | table |
JOHN123n456.00 | chair |
GEORGEJ | table |
GEORGEJ | table |
GEORGEJ | table |
GEORGENLA | clothing |
How I can do this in R using dplyr?
library(tidyverse)
names = c("MARY123L","MARYL123.00","MARYNLO","MARYNLA",
"JOHN330","JOHNNLA","JOHN123A","JOHN123n456.00","GEORGEJ","GEORGEJ","GEORGEJ","GEORGENLA")
DATA = tibble(names);DATA
CodePudding user response:
Essentially the $
(ends with) metacharacter is what you are looking for.
DATA |>
mutate(
var = case_when(
grepl("NLA$|NLO$", names) ~ "clothing",
grepl("[0-9]$", names) ~ "chair",
grepl("[[:alpha:]]$", names) ~ "table",
TRUE ~ "Something has gone wrong - this should never appear"
)
)
# A tibble: 12 x 2
# names var
# <chr> <chr>
# 1 MARY123L table
# 2 MARYL123.00 chair
# 3 MARYNLO clothing
# 4 MARYNLA clothing
# 5 JOHN330 chair
# 6 JOHNNLA clothing
# 7 JOHN123A table
# 8 JOHN123n456.00 chair
# 9 GEORGEJ table
# 10 GEORGEJ table
# 11 GEORGEJ table
# 12 GEORGENLA clothing
Difference between [[:alpha:]]$
and [a-zA-Z]$
I see another answer was posted at the same time which was pretty similar. It may get different results depending on your locale. For example:
accented_sometimes <- c(
"This line ends with a letter",
"But this line ends with é"
)
grepl("[[:alpha:]]$", accented_sometimes)
# [1] TRUE TRUE
grepl("[a-zA-Z]$", accented_sometimes)
# [1] TRUE FALSE
There can also be differences between \\d
and [0-9]
- see here for more. I suspect this depends heavily on which R you are using - I am using 4.1 on Windows which does not have Unicode support but any later version or the same version on Linux/Mac will do.
CodePudding user response:
You can do:
library(tidyverse)
DATA |>
mutate(var = case_when(str_detect(value, "NLA$|NLO$") ~ "clothing",
str_detect(value, "\\d $") ~ "chair",
str_detect(value, "[a-zA-Z]$") ~ "table"))
which gives:
# A tibble: 12 × 2
value var
<chr> <chr>
1 MARY123L table
2 MARYL123.00 chair
3 MARYNLO clothing
4 MARYNLA clothing
5 JOHN330 chair
6 JOHNNLA clothing
7 JOHN123A table
8 JOHN123n456.00 chair
9 GEORGEJ table
10 GEORGEJ table
11 GEORGEJ table
12 GEORGENLA clothing