How can I recode several values based on another data frame in several variables in R?-CodePudding

I have a data set with many columns (DATA_OLD) in which I want to exchange all values based on an allocation list with many entries (KEY).

Every value in DATA_OLD should be replaced by its counterpart (can be seen in KEY) to create DATA_NEW.

For simplicity, the example here contains a short KEY and DATA_OLD set. In reality, there are >2500 rows in KEY and >100 columns in DATA_OLD. Therefore, an approach that can be applied to the whole data set simultaneously without calling each colname of DATA_OLD is important.

KEY:

old	new
1	1
3	2
7	3
12	4
55	5

Following this example, every value "1" should be replaced with another value "1". Every value "3" should be replaced with value "2". Every value "7" should be replaced with value "3".

DATA_OLD (START):

var1	var2	var3
NA	3	NA
NA	55	NA
1	NA	NA
NA	NA	NA
3	NA	NA
55	NA	12

DATA_NEW (RESULT):

var1	var2	var3
NA	2	NA
NA	5	NA
1	NA	NA
NA	NA	NA
2	NA	NA
5	NA	4

Here reproducible data:

KEY<-structure(list(old = c(1, 3, 7, 12, 55), new = c(1, 2, 3, 4, 
5)), class = "data.frame", row.names = c(NA, -5L))

DATA_OLD<-structure(list(var1 = c(NA, NA, 1, NA, 3, 55), var2 = c(3, 
55, NA, NA, NA, NA), var3 = c(1, NA, NA, NA, NA, 12)), class = "data.frame", row.names = c(NA, -6L))

DATA_NEW<-structure(list(var1 = c(NA, NA, 1, NA, 2, 5), var2 = c(2, 
5, NA, NA, NA, NA), var3 = c(1, NA, NA, NA, NA, 4)), class = "data.frame", row.names = c(NA, -6L))

I have tried back and forth, and it appears that I am completely clueless. Help would be greatly apprecciated! The real data set is quite large...

CodePudding user response：

1) Base R Be careful here since some solutions have the side effect of converting the numeric columns to character or factor or the data frame to something else. A solution using match will generally work. The result of lapply is a list so convert back to data frame.

DATA_OLD |>
  lapply(function(x) with(KEY, new[match(x, old)])) |>
  as.data.frame()

DATA_NEW <- DATA_OLD
DATA_NEW[] <- lapply(DATA_OLD, function(x) with(KEY, new[match(x, old)]))

This last one is easy to convert to act only on some columns

DATA_NEW <- DATA_OLD
ix <- 1:2 # only convert these columns
DATA_NEW[ix] <- lapply(DATA_OLD[ix], function(x) with(KEY, new[match(x, old)]))

2) purrr Alternately use map_dfr which returns a data frame directly:

library(purrr)
map_dfr(DATA_OLD, ~ with(KEY, new[match(.x, old)]))

3) dplyr A dplyr solution using across is the following. If there were some non-numeric columns that should not be converted then replace everything() with where(is.numeric)

library(purrr)
DATA_OLD %>%
  mutate(across(everything(), ~ with(KEY, new[match(.x, old)])))

CodePudding user response：

The simplest way to implement a dictionary in R is a named array, where you can use the names as indices:

key <- setNames(KEY$new, KEY$old)
> key
 1  3  7 12 55 
 1  2  3  4  5

The only thing to be mindful of is that the indexing must by done by character, rather than integer:

> key[3]
7 
3  # WRONG! This is the 3rd item!
> key["3"]
3 
2  # RIGHT! This is the item named "3"

Then you can apply the transformation column-wise. This turns the data into a matrix, but you can simply turn it back.

as.data.frame(apply(DATA_OLD, 2, \(col) key[as.character(col)]))
  var1 var2 var3
1   NA    2    1
2   NA    5   NA
3    1   NA   NA
4   NA   NA   NA
5    2   NA   NA
6    5   NA    4