I have a dataframe where each row represents a spatial unit. The nbid* variables indicate which unit is a neighbour. I would like to get the dum variable of the neighbour into the main dataframe. (Instead of spatial units it could be any kind of relations within a dataframe - business partners, relatives, related genes etc.) Some simplified data look like this:
seed(999)
df_base <- data.frame(id = seq(1:100),
dum= sample(c(rep(0,50), rep(1,50)),100),
nbid_1=sample(1:100,100),
nbid_2=sample(1:100,100),
nbid_3=sample(1:100,100)) %>%
mutate(nbid_1 = replace(nbid_1, sample(row_number(), size = ceiling(0.1 * n()), replace = FALSE), NA),
nbid_2 = replace(nbid_2, sample(row_number(), size = ceiling(0.3 * n()), replace = FALSE), NA),
nbid_3 = replace(nbid_3, sample(row_number(), size = ceiling(0.7 * n()), replace = FALSE), NA))
(In these simplified data and other than in the real data, neighbours 1,2 and 3 can be the same, but that does not matter for the question.)
My approach was to duplicate and then join the data, which would look like this:
df1 <- df_base
df2 <- df_base %>%
select(-c(nbid_1,nbid_2,nbid_3)) %>%
rename(nbdum=dum)
df <- left_join(df1,df2,by=c("nbid_1"="id")) %>%
rename(nbdum1=nbdum) %>%
left_join(.,df2,by=c("nbid_2"="id")) %>%
rename(nbdum2=nbdum) %>%
left_join(.,df2,by=c("nbid_3"="id")) %>%
rename(nbdum3=nbdum)
df is the result that I am looking for - from here I can create an overall neighbour dummy or a count. This approach is however neither elegant nor feasible to implement with the real data which has many more neighbours.
How can I solve this in a less clumsy way?
Thanks in advance for your ideas!!
CodePudding user response:
A key clue is that when you see var_1, var_2, ..., var_n
, it suggests that the data can be transformed to be longer. See pivot_longer()
or data.table::melt()
where molten
data is discussed frequently.
For your example, we can pivot and then join the df2
table back. I am unsure if the format is needed but after the join, we can pivot back to wide with pivot_wider()
.
library(dplyr)
library(tidyr)
df1 %>%
select(!id) %>%
pivot_longer(cols = starts_with("nbid"), names_prefix = "nbid_")%>%
mutate(original_id = rep(1:100, each = 3))%>%
left_join(df2, by = c("value" = "id"))%>%
pivot_wider(original_id, values_from = c(value, nbdum))
#> # A tibble: 100 × 7
#> original_id value_1 value_2 value_3 nbdum_1 nbdum_2 nbdum_3
#> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 25 90 23 0 0 1
#> 2 2 12 NA NA 1 NA NA
#> 3 3 11 40 47 0 0 0
#> 4 4 94 87 NA 0 1 NA
#> 5 5 46 77 NA 1 0 NA
#> 6 6 98 82 NA 1 0 NA
#> 7 7 43 NA NA 1 NA NA
#> 8 8 74 NA 7 0 NA 1
#> 9 9 57 NA NA 1 NA NA
#> 10 10 49 72 NA 0 0 NA
#> # … with 90 more rows
## compare to original
as_tibble(df)
#> # A tibble: 100 × 8
#> id dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
#> <int> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 0 25 90 23 0 0 1
#> 2 2 1 12 NA NA 1 NA NA
#> 3 3 1 11 40 47 0 0 0
#> 4 4 1 94 87 NA 0 1 NA
#> 5 5 0 46 77 NA 1 0 NA
#> 6 6 1 98 82 NA 1 0 NA
#> 7 7 1 43 NA NA 1 NA NA
#> 8 8 0 74 NA 7 0 NA 1
#> 9 9 0 57 NA NA 1 NA NA
#> 10 10 0 49 72 NA 0 0 NA
#> # … with 90 more rows
CodePudding user response:
As you just seem to be indexing dum
with your neighbor variables you should be able to do:
library(dplyr)
df_base %>%
mutate(across(starts_with("nbid"), ~ dum[.x], .names = "nbdum_{1:3}"))
id dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
1 1 0 25 90 23 0 0 1
2 2 1 12 NA NA 1 NA NA
3 3 1 11 40 47 0 0 0
4 4 1 94 87 NA 0 1 NA
5 5 0 46 77 NA 1 0 NA
6 6 1 98 82 NA 1 0 NA
7 7 1 43 NA NA 1 NA NA
8 8 0 74 NA 7 0 NA 1
9 9 0 57 NA NA 1 NA NA
10 10 0 49 72 NA 0 0 NA
...
Or same idea in base R:
df_base[paste0("nbdum", 1:3)] <- sapply(df_base[startsWith(names(df_base), "nbid")], \(x) df_base$dum[x])