I have a character variable in a data frame, and would like to create a new numeric variable in that data frame based on the existence of specific strings in that character variable.
I've figured out one way to do this, but was wondering if there was a quicker, or more elegant approach than mine here.
Here is my solution, using ifelse and substr:
First, the values of each observation of my character variable is a random series of numbers and letters (from 0-9, and A-Z). I will create an example variable below:
library(stringi)
set.seed(100)
my.df<-data.frame("V1"=sprintf("%s%s%s", stri_rand_strings(10, 5, c('[A-B]','[0-2]')),
stri_rand_strings(10, 4, c('[0-9]','[A-J]')), stri_rand_strings(10, 1, '[A-Z]')))
In my actual data, as I noted above, the first element of each observation can be any of the number 0-9, OR letters A-Z.
Now, what I've done is create a new variable [value] that takes on specific values (no pattern) depending upon what the first element of each observation in the V1 variable is. So, if the first element is the letter "A", the value of corresponding observation (row) in my new variable--V2--is 3, and if the first element of the V1 variable is "B" the value of the corresponding observation in my new variable V2 is 12, etc.
Here is how I have chosen to do this. It's unwieldy as my real data would need ifelse clauses for all numerals 0-9 and all letters A-Z, which would be about 35 lines or so. Here is my code:
my.df$value<-ifelse(substr(my.df$V1,1,1)=="A",3,
ifelse(substr(my.df$V1,1,1)=="B",12,
ifelse(substr(my.df$V1,1,1)=="0",44,
ifelse(substr(my.df$V1,1,1)=="1",6,27))))
This yields:
V1 value
'AABAA3122X' 1
'12110FCBCF' 6
'BBAAB5246J' 12
'20112JGEDL' 27
'BBBBA4426X' 12
'02210EDFJK' 44
'ABABB6687N' 1
'20120IAEFD' 27
'ABBBB9905A' 1
'12200HCDHU' 6
Is there a way to do this without using so many lines of code?
Thanks!
CodePudding user response:
Consider either a join with key/val data or a named vector
my.df$value <- with(my.df, setNames(c(3, 12, 44, 6),
c("A", "B", "0", "1"))[substr(V1, 1, 1)])
my.df$value[is.na(my.df$value)] <- "27"
-output
> my.df
V1 value
1 AABAA3122X 3
2 12110FCBCF 6
3 BBAAB5246J 12
4 20112JGEDL 27
5 BBBBA4426X 12
6 02210EDFJK 44
7 ABABB6687N 3
8 20120IAEFD 27
9 ABBBB9905A 3
10 12200HCDHU 6
CodePudding user response:
Two potential approaches based on: Canonical tidyverse method to update some values of a vector from a look-up table
library(tidyverse)
library(stringi)
library(data.table)
set.seed(100)
my.df <- data.frame("V1" = sprintf("%s%s%s",
stri_rand_strings(10, 5, c('[A-B]','[0-2]')),
stri_rand_strings(10, 4, c('[0-9]','[A-J]')),
stri_rand_strings(10, 1, '[A-Z]')))
df <- my.df %>%
mutate(Value = substr(V1, 1, 1))
unique(df$Value)
#> [1] "A" "1" "B" "2" "0"
lookup <- data.frame(old = c("A", 1, "B", 2, 0),
new = c(3, 6, 12, 27, 44))
for (i in seq_len(nrow(lookup))) {
df$Value[df$Value == lookup$old[i]] = lookup$new[i]
}
df
#> V1 Value
#> 1 AABAA3122X 3
#> 2 12110FCBCF 6
#> 3 BBAAB5246J 12
#> 4 20112JGEDL 27
#> 5 BBBBA4426X 12
#> 6 02210EDFJK 44
#> 7 ABABB6687N 3
#> 8 20120IAEFD 27
#> 9 ABBBB9905A 3
#> 10 12200HCDHU 6
# data.table method (fastest but doesn't retain original order of V1)
df <- my.df %>%
mutate(Value = substr(V1, 1, 1))
setDT(df)
setDT(lookup)
setkey(df, Value)
setkey(lookup, old)
df[lookup, Value:=new, on=.(Value=old)]
df
#> V1 Value
#> 1: 02210EDFJK 44
#> 2: 12110FCBCF 6
#> 3: 12200HCDHU 6
#> 4: 20112JGEDL 27
#> 5: 20120IAEFD 27
#> 6: AABAA3122X 3
#> 7: ABABB6687N 3
#> 8: ABBBB9905A 3
#> 9: BBAAB5246J 12
#> 10: BBBBA4426X 12
Created on 2021-10-06 by the reprex package (v2.0.1)