Home > Blockchain >  Creating new numeric variable in R based on specific strings from character variable
Creating new numeric variable in R based on specific strings from character variable

Time:10-06

I have a character variable in a data frame, and would like to create a new numeric variable in that data frame based on the existence of specific strings in that character variable.

I've figured out one way to do this, but was wondering if there was a quicker, or more elegant approach than mine here.

Here is my solution, using ifelse and substr:

First, the values of each observation of my character variable is a random series of numbers and letters (from 0-9, and A-Z). I will create an example variable below:

library(stringi)
set.seed(100)
my.df<-data.frame("V1"=sprintf("%s%s%s", stri_rand_strings(10, 5, c('[A-B]','[0-2]')),
    stri_rand_strings(10, 4, c('[0-9]','[A-J]')), stri_rand_strings(10, 1, '[A-Z]')))

In my actual data, as I noted above, the first element of each observation can be any of the number 0-9, OR letters A-Z.

Now, what I've done is create a new variable [value] that takes on specific values (no pattern) depending upon what the first element of each observation in the V1 variable is. So, if the first element is the letter "A", the value of corresponding observation (row) in my new variable--V2--is 3, and if the first element of the V1 variable is "B" the value of the corresponding observation in my new variable V2 is 12, etc.

Here is how I have chosen to do this. It's unwieldy as my real data would need ifelse clauses for all numerals 0-9 and all letters A-Z, which would be about 35 lines or so. Here is my code:

my.df$value<-ifelse(substr(my.df$V1,1,1)=="A",3,
                    ifelse(substr(my.df$V1,1,1)=="B",12,
                           ifelse(substr(my.df$V1,1,1)=="0",44,
                                           ifelse(substr(my.df$V1,1,1)=="1",6,27))))

This yields:

        V1 value
'AABAA3122X'     1
'12110FCBCF'     6
'BBAAB5246J'    12
'20112JGEDL'    27
'BBBBA4426X'    12
'02210EDFJK'    44
'ABABB6687N'     1
'20120IAEFD'    27
'ABBBB9905A'     1
'12200HCDHU'     6

Is there a way to do this without using so many lines of code?

Thanks!

CodePudding user response:

Consider either a join with key/val data or a named vector

my.df$value <- with(my.df, setNames(c(3, 12, 44, 6),
          c("A", "B", "0", "1"))[substr(V1, 1, 1)])
my.df$value[is.na(my.df$value)] <- "27"

-output

> my.df
           V1 value
1  AABAA3122X     3
2  12110FCBCF     6
3  BBAAB5246J    12
4  20112JGEDL    27
5  BBBBA4426X    12
6  02210EDFJK    44
7  ABABB6687N     3
8  20120IAEFD    27
9  ABBBB9905A     3
10 12200HCDHU     6

CodePudding user response:

Two potential approaches based on: Canonical tidyverse method to update some values of a vector from a look-up table

library(tidyverse)
library(stringi)
library(data.table)

set.seed(100)
my.df <- data.frame("V1" = sprintf("%s%s%s",
                                   stri_rand_strings(10, 5, c('[A-B]','[0-2]')),
                                   stri_rand_strings(10, 4, c('[0-9]','[A-J]')),
                                   stri_rand_strings(10, 1, '[A-Z]')))

df <- my.df %>%
  mutate(Value = substr(V1, 1, 1))

unique(df$Value)
#> [1] "A" "1" "B" "2" "0"
lookup <- data.frame(old = c("A", 1, "B", 2, 0),
                     new = c(3, 6, 12, 27, 44))

for (i in seq_len(nrow(lookup))) {
    df$Value[df$Value == lookup$old[i]] = lookup$new[i]
}
df
#>            V1 Value
#> 1  AABAA3122X     3
#> 2  12110FCBCF     6
#> 3  BBAAB5246J    12
#> 4  20112JGEDL    27
#> 5  BBBBA4426X    12
#> 6  02210EDFJK    44
#> 7  ABABB6687N     3
#> 8  20120IAEFD    27
#> 9  ABBBB9905A     3
#> 10 12200HCDHU     6


# data.table method (fastest but doesn't retain original order of V1)
df <- my.df %>%
  mutate(Value = substr(V1, 1, 1))

setDT(df)
setDT(lookup)
setkey(df, Value)
setkey(lookup, old)

df[lookup, Value:=new, on=.(Value=old)]
df
#>             V1 Value
#>  1: 02210EDFJK    44
#>  2: 12110FCBCF     6
#>  3: 12200HCDHU     6
#>  4: 20112JGEDL    27
#>  5: 20120IAEFD    27
#>  6: AABAA3122X     3
#>  7: ABABB6687N     3
#>  8: ABBBB9905A     3
#>  9: BBAAB5246J    12
#> 10: BBBBA4426X    12

Created on 2021-10-06 by the reprex package (v2.0.1)

  • Related