Assign a value based on the first letter of each word in another column-CodePudding

I want to create a column 2 (i.e., firstletter) with a numeric value (e.g., 1) assigned depending on the first letter of a word in column 1 (i.e., catname). In the sample dataset, column 1 has a list of cats' names and I want to assign 1 to cats whose first letter of the name starts with A, 2 to cats whose first letter of the name starts with B, 3 to C, and so forth until the letter Z.

df <- data.frame(catname=c("Ave", "Ares", "Aze", "Bill", "Buz", "Chris", "Chase", "Charlie", "Coco"))

At the moment, I can only think of doing this using case_when() function, e.g.,

df %>% mutate(firstletter = case_when(str_start(catname) == "A" ~ "1",
                                      str_start(catname) == "B" ~ "2",
                                      str_start(catname) == "C" ~ "3"))

So the resulting outcome I hope is

| catname  | firstletter    |
| -------- | -------------- |
| Ave      | 1              |
| Ares     | 1              |
| Aze      | 1              |
| Bill     | 2              |
| Buz      | 2              |
| Chris    | 3              |
| Chase    | 3              |
| Charlie  | 3              |
| Coco     | 3              |

I would appreciate your insights if there is another way to approach my problem.

CodePudding user response：

Please also check the below code with data.table::rleid()

library(data.table)
library(dplyr)
library(stringr) # str_sub

data.frame(catname=c("Ave", "Ares", "Aze", "Bill", "Buz", "Chris", "Chase", "Charlie", "Coco")) %>% 
  mutate(firstletter=rleid(str_sub(catname,1,1)))

^{Created on 2023-01-31 with reprex v2.0.2}

  catname firstletter
1     Ave           1
2    Ares           1
3     Aze           1
4    Bill           2
5     Buz           2
6   Chris           3
7   Chase           3
8 Charlie           3
9    Coco           3

CodePudding user response：

There is also the new consecutive_id:

library(dplyr) #1.1.0
df %>% 
  mutate(firstletter = consecutive_id(substr(catname, 1, 1)))

Also new, the case_match option which avoids repetition and allows more flexibility:

df %>% 
  mutate(firstletter = case_match(substr(catname, 1, 1),
                                  "A" ~ 1,
                                  "B" ~ 2,
                                  "C" ~ 3))

CodePudding user response：

You can subset to the first character, and then match against the build in LETTER array if you want the values to always be 1...26 even if some letters might be missing

df %>% mutate(first=match(substr(catname,1,1), LETTERS))

If you only want numbers for observed values, you can use the factor trick:

df %>% mutate(first=as.numeric(factor(substr(catname,1,1))))

CodePudding user response：

Here is an alternative approach:

Here we use which to determine the position of the first letter:

library(dplyr)

df %>% 
  rowwise() %>% 
  mutate(firstletter = which(LETTERS == substr(catname,1,1)))

 catname firstletter
  <chr>         <int>
1 Ave               1
2 Ares              1
3 Aze               1
4 Bill              2
5 Buz               2
6 Chris             3
7 Chase             3
8 Charlie           3
9 Coco              3

CodePudding user response：

Put your new codes in a named vector and use match

> match(substr(df$catname,1,1),c("1"="A","2"="B","3"="C"))
[1] 1 1 1 2 2 3 3 3 3