How to map a data.frame with codes to strings based on dictionary in R-CodePudding

mydata <- data.frame(id = 1:5,
                     item1 = c("", "", "1222", "1222", ""),
                     item2 = c("13", "", "", "", "382"))
> mydata
  id item1 item2
1  1          13
2  2            
3  3  1222      
4  4   987      
5  5         382

I have a dataset where it contains different codes. I want to map these codes to strings based on a dictionary

dictionary <- data.frame(code = c(1, 13, 382, 987, 1222),
                entry = c("ballet", "soccer", "basketball", "painting", "pottery"))
> dictionary
  code      entry
1    1     ballet
2   13     soccer
3  382 basketball
4  987   painting
5 1222    pottery

The desired output is a data.frame with strings:

  id   item1      item2
1  1             soccer
2  2                   
3  3 pottery           
4  4 pottery           
5  5         basketball

CodePudding user response：

mydata <- data.frame(id = 1:5,
                     item1 = c("", "", "1222", "1222", ""),
                     item2 = c("13", "", "", "", "382"))

dictionary <- data.frame(code = c(1, 13, 382, 987, 1222),
                entry = c("ballet", "soccer", "basketball", "painting", "pottery"))

mydata$item1 <- ifelse(mydata$item1 %in% dictionary$code, 
                           dictionary$entry[match( mydata$item1,dictionary$code)], "")

mydata$item2 <- ifelse(mydata$item2 %in% dictionary$code, 
                           dictionary$entry[match( mydata$item2,dictionary$code)], "")

mydata
#>   id   item1      item2
#> 1  1             soccer
#> 2  2                   
#> 3  3 pottery           
#> 4  4 pottery           
#> 5  5         basketball

^{Created on 2022-12-15 with reprex v2.0.2}

NOTE:

Of course, a canonical way to do this is to use factors (and NA's instead of empty strings):

factor(mydata$item1, dictionary$code, dictionary$entry)
#  <NA>    <NA>    pottery pottery <NA>

CodePudding user response：

May be a bit easier using a named vector as your dictionary. Using dplyr:

library(dplyr)

dictionary_vec <- setNames(dictionary$entry, dictionary$code)

mydata %>% 
  mutate(across(!id, ~ replace_na(dictionary_vec[.x], "")))

Or in base R:

dictionary_vec <- setNames(dictionary$entry, dictionary$code)

for (cn in colnames(mydata)[-1]) {
  mydata[[cn]] <- dictionary_vec[mydata[[cn]]]
}

mydata[is.na(mydata)] <- ""

Result from either approach:

  id   item1      item2
1  1             soccer
2  2                   
3  3 pottery           
4  4 pottery           
5  5         basketball

CodePudding user response：

This works pretty well.


mydata <- data.frame(id = 1:5,
                     item1 = c("", "", "1222", "1222", ""),
                     item2 = c("13", "", "", "", "382"))

dictionary <- data.frame(code = c(1, 13, 382, 987, 1222),
                         entry = c("ballet", "soccer", "basketball", "painting", "pottery"))

mydata$item1[mydata$item1 %in% dictionary$code]<-dictionary$entry[dictionary$code%in%mydata$item1]

mydata$item2[mydata$item2 %in% dictionary$code]<-dictionary$entry[dictionary$code%in%mydata$item2]

mydata

  id   item1      item2
1  1             soccer
2  2                   
3  3 pottery           
4  4 pottery           
5  5         basketball

CodePudding user response：

Elaborating on the "canonical way" using factors.

mydata[-1] <- lapply(mydata[-1], factor, levels=dictionary$code, labels=dictionary$entry) |> 
  lapply(droplevels)
mydata
#   id   item1      item2
# 1  1    <NA>     soccer
# 2  2    <NA>       <NA>
# 3  3 pottery       <NA>
# 4  4 pottery       <NA>
# 5  5    <NA> basketball

The |> lapply(droplevels) cares for having just the levels that appear in each column. If you want character just replace it with |> lapply(as.character)