I have a dataframe that looks like the following:

ID V1 V2 V3 V4 V5
1  a  6  3  5   3
2  c  4  1  2   1
3  g  8  2  4   2
4  h  7  9  8   1
5  a  4  6  2   1
6  b  4  2  1   2
7  j  8  7  1   4

I need to create a new dummy variable and add it to this dataframe as column "V6". I need to do it based on a matrix from an external spreadsheet such as the following:

V1 1 2 3 4 5 6 7 8 9
a  1 1   1 1 1   
b  1     1 1 1 1 1 1
c                  1
d  1 1 1 1 1 1 1 1 1
g  1 1 
h    1             1
i          1 1     1
j  
k  1 1 1 1 1

In the above matrix, the V1 column is the value of the V1 variable in the original dataframe, and the other columns correspond with possible values of the V5 variable. All the empty spaces are blank in the spreadsheet. I need the new dummy variable, V6 to represent 1 if the unit is a 1 on the matrix based on the intersection of values. The result would therefore be the following:

ID V1 V2 V3 V4 V5 V6
1  a  6  3  5   3  0
2  c  4  1  2   1  0 
3  g  8  2  4   2  1
4  h  7  9  8   1  0
5  a  4  6  2   1  1
6  b  4  2  1   2  0
7  j  8  7  1   4  0

ID 1 is a 0 for the V6 variable, because in the matrix, a and the value 3 intersect at a blank (or 0). Therefore the dummy variable for row 1 is a 0, because its V1 is a and its V5 is 3. Conversely, the third row generates a 1, because its V1 is G and its V5 value is 2. That intersection on the matrix, g-2 is a 1, therefore V6 for that combination is a "hit", or a 1 in the dummy variable

I recognize this is an odd method of dummy variable creation, but how can one use an externally created spreadsheet like this to create dummy variables based on the intersection of values most efficiently? What would be a flexible way to code this, so that it could be adapted depending on if the variables are character or numeric?

CodePudding user response：

I think it's best to approach this by pivoting/reshaping df2 (the 1s and blanks), and joining it on df1 (original data).

Note: it isn't abundantly clear if your df2 has empty strings or NA values. If the latter, then replace the nzchar(V6) with !is.na(V6) or !V6 %in% c(NA, "") (for both possibilities).

base R

out <- reshape2::melt(df2, "V1", variable.name = "V5", value.name = "V6") |>
  subset(nzchar(V6)) |>
  merge(df1, by = c("V1", "V5"), all.y = TRUE) |>
  transform(V6 =  (!is.na(V6)))
out
#   V1 V5 V6 ID V2 V3 V4
# 1  a  1  1  5  4  6  2
# 2  a  3  0  1  6  3  5
# 3  b  2  0  6  4  2  1
# 4  c  1  0  2  4  1  2
# 5  g  2  1  3  8  2  4
# 6  h  1  0  4  7  9  8
# 7  j  4  0  7  8  7  1

The rows/columns are out of order, we can restore it fairly easily:

out <- out[order(out$ID), c("ID", sort(setdiff(names(out), "ID")))]
out
#   ID V1 V2 V3 V4 V5 V6
# 2  1  a  6  3  5  3  0
# 4  2  c  4  1  2  1  0
# 5  3  g  8  2  4  2  1
# 6  4  h  7  9  8  1  0
# 1  5  a  4  6  2  1  1
# 3  6  b  4  2  1  2  0
# 7  7  j  8  7  1  4  0

dplyr/tidyr

library(dplyr)
library(tidyr) # pivot_longer
df2 %>%
  pivot_longer(-V1, names_to = "V5", values_to = "V6") %>%
  filter(nzchar(V6)) %>%
  # dplyr requires the join columns to be the same class, but the
  # column names from `df2` are still character, as all column names are
  mutate(V5 = as.integer(V5)) %>%
  left_join(df1, ., by = c("V1", "V5")) %>%
  mutate(V6 =  (!is.na(V6)))
#   ID V1 V2 V3 V4 V5 V6
# 1  1  a  6  3  5  3  0
# 2  2  c  4  1  2  1  0
# 3  3  g  8  2  4  2  1
# 4  4  h  7  9  8  1  0
# 5  5  a  4  6  2  1  1
# 6  6  b  4  2  1  2  0
# 7  7  j  8  7  1  4  0

Data

df1 <- structure(list(ID = 1:7, V1 = c("a", "c", "g", "h", "a", "b", "j"), V2 = c(6L, 4L, 8L, 7L, 4L, 4L, 8L), V3 = c(3L, 1L, 2L, 9L, 6L, 2L, 7L), V4 = c(5L, 2L, 4L, 8L, 2L, 1L, 1L), V5 = c(3L, 1L, 2L, 1L, 1L, 2L, 4L)), class = "data.frame", row.names = c(NA, -7L))
df2 <- structure(list(V1 = c("a", "b", "c", "d", "g", "h", "i", "j", "k"), "1" = c("1", "1", "", "1", "1", "", "", "", "1"), "2" = c("1", "", "", "1", "1", "1", "", "", "1"), "3" = c("", "", "", "1", "", "", "", "", "1"), "4" = c("1", "1", "", "1", "", "", "", "", "1"), "5" = c("1", "1", "", "1", "", "", "1", "", "1"), "6" = c("1", "1", "", "1", "", "", "1", "", ""), "7" = c("", "1", "", "1", "", "", "", "", ""), "8" = c("", "1", "", "1", "", "", "", "", ""), "9" = c("", "1", "1", "1", "", "1", "1", "",  "")), row.names = c(NA, -9L), class = "data.frame")