How can i create a new column in R that will return specific values according to the initial values-CodePudding

I have a data frame that looks like this :

names
MARY123L
MARYL123.00
MARYNLO
MARYNLA
JOHN330
JOHNNLA
JOHN123A
JOHN123n456.00
GEORGEJ
GEORGEJ
GEORGEJ
GEORGENLA

i want to create a new column variable that will check each element in the column name and will return a word according to a condition :

if the word in the column names ends with a letter to give me the word "table",
f the word in the column names ends with a number to give me the word "chair"
and if the word in the column names ends with a "NLA" or "NLO" to give me the word "clothing"

Ideally i want the new data frame to look like this:

names	var
MARY123L	table
MARYL123.00	chair
MARYNLO	clothing
MARYNLA	clothing
JOHN330	chair
JOHNNLA	clothing
JOHN123A	table
JOHN123n456.00	chair
GEORGEJ	table
GEORGEJ	table
GEORGEJ	table
GEORGENLA	clothing

How I can do this in R using dplyr?

library(tidyverse)
names = c("MARY123L","MARYL123.00","MARYNLO","MARYNLA",
          "JOHN330","JOHNNLA","JOHN123A","JOHN123n456.00","GEORGEJ","GEORGEJ","GEORGEJ","GEORGENLA")
DATA = tibble(names);DATA

CodePudding user response：

Essentially the $ (ends with) metacharacter is what you are looking for.

DATA  |>
    mutate(
        var = case_when(
            grepl("NLA$|NLO$", names) ~ "clothing",
            grepl("[0-9]$", names) ~ "chair", 
            grepl("[[:alpha:]]$", names) ~ "table",
            TRUE ~ "Something has gone wrong - this should never appear"
        )
    )

# A tibble: 12 x 2
#    names          var     
#    <chr>          <chr>
#  1 MARY123L       table
#  2 MARYL123.00    chair
#  3 MARYNLO        clothing
#  4 MARYNLA        clothing
#  5 JOHN330        chair
#  6 JOHNNLA        clothing
#  7 JOHN123A       table
#  8 JOHN123n456.00 chair
#  9 GEORGEJ        table
# 10 GEORGEJ        table
# 11 GEORGEJ        table
# 12 GEORGENLA      clothing

Difference between `[[:alpha:]]$` and `[a-zA-Z]$`

I see another answer was posted at the same time which was pretty similar. It may get different results depending on your locale. For example:

accented_sometimes  <- c(
    "This line ends with a letter", 
    "But this line ends with é"
)

grepl("[[:alpha:]]$", accented_sometimes)
# [1] TRUE TRUE
grepl("[a-zA-Z]$", accented_sometimes)
# [1]  TRUE FALSE

There can also be differences between \\d and [0-9] - see here for more. I suspect this depends heavily on which R you are using - I am using 4.1 on Windows which does not have Unicode support but any later version or the same version on Linux/Mac will do.

CodePudding user response：

You can do:

library(tidyverse)
DATA |> 
  mutate(var = case_when(str_detect(value, "NLA$|NLO$") ~ "clothing",
                         str_detect(value, "\\d $") ~ "chair",
                         str_detect(value, "[a-zA-Z]$") ~ "table"))

which gives:

# A tibble: 12 × 2
   value          var     
   <chr>          <chr>   
 1 MARY123L       table   
 2 MARYL123.00    chair   
 3 MARYNLO        clothing
 4 MARYNLA        clothing
 5 JOHN330        chair   
 6 JOHNNLA        clothing
 7 JOHN123A       table   
 8 JOHN123n456.00 chair   
 9 GEORGEJ        table   
10 GEORGEJ        table   
11 GEORGEJ        table   
12 GEORGENLA      clothing

Difference between [[:alpha:]]$ and [a-zA-Z]$

Difference between `[[:alpha:]]$` and `[a-zA-Z]$`