Match by minimum Euclidean distance in R-CodePudding

I have the following dataset:

X <- data.frame(PERMNO = c(10001,10002,10003,10001,10002,10003),
                Date = c("Nov 2021","Nov 2021","Nov 2021","Dec 2021","Dec 2021","Dec 2021"),
                     ME = c(100,95,110,110,115,108),
                     IVOL = c(1,1.1,0.8,0.7,1,2.1),
                     C = c(NA, 2, 3,NA, 4, 1.5))

For firm 10001, the C is missing. I want to fill C, each month, by matching C from other firms by using the firm with non-missing C that minimizes the euclidean distance of the ranked ME and the ranked IVOL with the missing firm:

X in my application has more PERMNOs and a longer time frame, and multiple firms may have C missing. My question is how to code this efficiently in R.

Taking the ranks is straight forward by using rank() and the Euclidean distance can be calculated using outer() if I am correct. However, I struggle with making the pairs of firm i and j and then selecting the minimum distance and subsequently match C from firm j to the missing C for firm i.

CodePudding user response：

A data.table solution using colMins from the Rfast package.

library(data.table)

X <- data.frame(PERMNO = c(10001,10002,10003,10001,10002,10003),
                Date = c("Nov 2021","Nov 2021","Nov 2021","Dec 2021","Dec 2021","Dec 2021"),
                ME = c(100,95,110,110,115,108),
                IVOL = c(1,1.1,0.8,0.7,1,2.1),
                C = c(NA, 2, 3, NA, 4, 1.5))

fFillNA <- function(C, ME, IVOL) {
  idxNA <- which(is.na(C))
  C[idxNA] <- C[-idxNA][Rfast::colMins(outer(ME[-idxNA], ME[idxNA], "-")^2   outer(IVOL[-idxNA], IVOL[idxNA], "-")^2)]
  C
}

setDT(X)[, C := if(anyNA(C)) fFillNA(C, ME, IVOL), by = "Date"]
X
#>    PERMNO     Date  ME IVOL   C
#> 1:  10001 Nov 2021 100  1.0 2.0
#> 2:  10002 Nov 2021  95  1.1 2.0
#> 3:  10003 Nov 2021 110  0.8 3.0
#> 4:  10001 Dec 2021 110  0.7 1.5
#> 5:  10002 Dec 2021 115  1.0 4.0
#> 6:  10003 Dec 2021 108  2.1 1.5

No need to take the square root to get the index of the minimum distance. Also, notice that because of relative size, ME impacts the distance calculation much more than IVOL, at least for the example dataset. Maybe consider normalizing ME and IVOL in the distance calculation.

CodePudding user response：

Maybe this helps:

library(tidyverse)

X <- data.frame(
  PERMNO = c(10001, 10002, 10003, 10001, 10002, 10003),
  Date = c("Nov 2021", "Nov 2021", "Nov 2021", "Dec 2021", "Dec 2021", "Dec 2021"),
  ME = c(100, 95, 110, 110, 115, 108),
  IVOL = c(1, 1.1, 0.8, 0.7, 1, 2.1),
  C = c(NA, 2, 3, NA, 4, 1.5)
) %>% as_tibble()
X
#> # A tibble: 6 x 5
#>   PERMNO Date        ME  IVOL     C
#>    <dbl> <chr>    <dbl> <dbl> <dbl>
#> 1  10001 Nov 2021   100   1    NA  
#> 2  10002 Nov 2021    95   1.1   2  
#> 3  10003 Nov 2021   110   0.8   3  
#> 4  10001 Dec 2021   110   0.7  NA  
#> 5  10002 Dec 2021   115   1     4  
#> 6  10003 Dec 2021   108   2.1   1.5

imputations <-
  X %>%
  rename_all(~ paste0(.x, ".1")) %>%
  expand_grid(X %>% rename_all(~ paste0(., ".2"))) %>%
  mutate(
    dist = sqrt((rank(ME.1) - rank(ME.2))**2   (rank(IVOL.1) - rank(IVOL.2))**2)
  ) %>%
  group_by(PERMNO.1) %>%
  filter(PERMNO.1 != PERMNO.2) %>%
  arrange(dist) %>%
  slice(1) %>%
  ungroup() %>%
  transmute(
    PERMNO = PERMNO.1,
    imputed.C = case_when(
      !is.na(C.1) ~ C.1,
      !is.na(C.2) ~ C.2
    )
  )
imputations
#> # A tibble: 3 x 2
#>   PERMNO imputed.C
#>    <dbl>     <dbl>
#> 1  10001         3
#> 2  10002         2
#> 3  10003         3

X %>%
  left_join(imputations) %>%
  mutate(C = ifelse(is.na(C), imputed.C, C)) %>%
  select(-imputed.C)
#> Joining, by = "PERMNO"
#> # A tibble: 6 x 5
#>   PERMNO Date        ME  IVOL     C
#>    <dbl> <chr>    <dbl> <dbl> <dbl>
#> 1  10001 Nov 2021   100   1     3  
#> 2  10002 Nov 2021    95   1.1   2  
#> 3  10003 Nov 2021   110   0.8   3  
#> 4  10001 Dec 2021   110   0.7   3  
#> 5  10002 Dec 2021   115   1     4  
#> 6  10003 Dec 2021   108   2.1   1.5

^{Created on 2022-02-18 by the reprex package (v2.0.0)}