Join many to one: combine related characteristics-CodePudding

I have a dataframe where each row represents a spatial unit. The nbid* variables indicate which unit is a neighbour. I would like to get the dum variable of the neighbour into the main dataframe. (Instead of spatial units it could be any kind of relations within a dataframe - business partners, relatives, related genes etc.) Some simplified data look like this:

seed(999)
df_base <- data.frame(id = seq(1:100),
                 dum= sample(c(rep(0,50), rep(1,50)),100),
                 nbid_1=sample(1:100,100),
                 nbid_2=sample(1:100,100),
                 nbid_3=sample(1:100,100)) %>% 
  mutate(nbid_1 =  replace(nbid_1, sample(row_number(), size = ceiling(0.1 * n()), replace = FALSE), NA),
         nbid_2 =  replace(nbid_2, sample(row_number(), size = ceiling(0.3 * n()), replace = FALSE), NA),
         nbid_3 =  replace(nbid_3, sample(row_number(), size = ceiling(0.7 * n()), replace = FALSE), NA))

(In these simplified data and other than in the real data, neighbours 1,2 and 3 can be the same, but that does not matter for the question.)

My approach was to duplicate and then join the data, which would look like this:

df1 <- df_base
df2 <- df_base %>% 
  select(-c(nbid_1,nbid_2,nbid_3)) %>% 
  rename(nbdum=dum)

df <- left_join(df1,df2,by=c("nbid_1"="id")) %>% 
  rename(nbdum1=nbdum) %>% 
  left_join(.,df2,by=c("nbid_2"="id")) %>% 
  rename(nbdum2=nbdum) %>% 
  left_join(.,df2,by=c("nbid_3"="id")) %>% 
  rename(nbdum3=nbdum)

df is the result that I am looking for - from here I can create an overall neighbour dummy or a count. This approach is however neither elegant nor feasible to implement with the real data which has many more neighbours.

How can I solve this in a less clumsy way?

Thanks in advance for your ideas!!

CodePudding user response：

A key clue is that when you see var_1, var_2, ..., var_n, it suggests that the data can be transformed to be longer. See pivot_longer() or data.table::melt() where molten data is discussed frequently.

For your example, we can pivot and then join the df2 table back. I am unsure if the format is needed but after the join, we can pivot back to wide with pivot_wider().

library(dplyr)
library(tidyr)


df1 %>%
  select(!id) %>%
  pivot_longer(cols = starts_with("nbid"), names_prefix = "nbid_")%>%
  mutate(original_id = rep(1:100, each = 3))%>%
  left_join(df2, by = c("value" = "id"))%>%
  pivot_wider(original_id, values_from = c(value, nbdum))

#> # A tibble: 100 × 7
#>    original_id value_1 value_2 value_3 nbdum_1 nbdum_2 nbdum_3
#>          <int>   <int>   <int>   <int>   <dbl>   <dbl>   <dbl>
#>  1           1      25      90      23       0       0       1
#>  2           2      12      NA      NA       1      NA      NA
#>  3           3      11      40      47       0       0       0
#>  4           4      94      87      NA       0       1      NA
#>  5           5      46      77      NA       1       0      NA
#>  6           6      98      82      NA       1       0      NA
#>  7           7      43      NA      NA       1      NA      NA
#>  8           8      74      NA       7       0      NA       1
#>  9           9      57      NA      NA       1      NA      NA
#> 10          10      49      72      NA       0       0      NA
#> # … with 90 more rows

## compare to original

as_tibble(df)
#> # A tibble: 100 × 8
#>       id   dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
#>    <int> <dbl>  <int>  <int>  <int>  <dbl>  <dbl>  <dbl>
#>  1     1     0     25     90     23      0      0      1
#>  2     2     1     12     NA     NA      1     NA     NA
#>  3     3     1     11     40     47      0      0      0
#>  4     4     1     94     87     NA      0      1     NA
#>  5     5     0     46     77     NA      1      0     NA
#>  6     6     1     98     82     NA      1      0     NA
#>  7     7     1     43     NA     NA      1     NA     NA
#>  8     8     0     74     NA      7      0     NA      1
#>  9     9     0     57     NA     NA      1     NA     NA
#> 10    10     0     49     72     NA      0      0     NA
#> # … with 90 more rows

CodePudding user response：

As you just seem to be indexing dum with your neighbor variables you should be able to do:

library(dplyr)

df_base %>%
  mutate(across(starts_with("nbid"), ~ dum[.x], .names = "nbdum_{1:3}"))

     id dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
1     1   0     25     90     23      0      0      1
2     2   1     12     NA     NA      1     NA     NA
3     3   1     11     40     47      0      0      0
4     4   1     94     87     NA      0      1     NA
5     5   0     46     77     NA      1      0     NA
6     6   1     98     82     NA      1      0     NA
7     7   1     43     NA     NA      1     NA     NA
8     8   0     74     NA      7      0     NA      1
9     9   0     57     NA     NA      1     NA     NA
10   10   0     49     72     NA      0      0     NA
...

Or same idea in base R:

df_base[paste0("nbdum", 1:3)] <- sapply(df_base[startsWith(names(df_base), "nbid")], \(x) df_base$dum[x])