Sequentually subtract each value in a vector from all values in another vector-CodePudding

I have a list of all CpG locations (base pair value) for a gene on a methylation array in one table (table a), and another table with the locations (base pair value) for 12 CpGs for the same gene not present on the array (table b). I am trying to work out for each probe in table_b, which probe in table_a is the closest in bp.

i.e. table_a

# A tibble: 88 x 2
   UCSC_RefGene_Name       pos
   <chr>                 <int>
 1 RXRA              137218280
 2 RXRA              137243592
 3 RXRA              137330570
 4 RXRA              137225311
 5 RXRA              137299436
 6 RXRA              137277819
 7 RXRA              137268074
 8 RXRA              137255666
 9 RXRA;RXRA         137284989
10 RXRA              137218286
# ... with 78 more rows

table_b

   CpG.position Human.genome.19.coordinates
1             1                   137215735
2             2                   137215739
3             3                   137215748
4             4                   137215772
5             5                   137215779
6             6                   137215867
7             7                   137215956
8             8                   137216015
9             9                   137216030
10           10                   137216034
11           11                   137216036
12           12                   137216064

My first step was to sequentially subtract the each value in A from the first row in B -

bibs <- function(table, value, column){
position <- sym(column)
smaps <- 
  table %>% 
  summarise(
    "cpg_pos" = table$CpG.position,
    "new_loc" = value - {{position}})
print(smaps)
}

posns <- table_a$positions

abso <- list()
for(i in seq_along(posns)){
  a <- bibs(table_b, posns[[i]], "Human.genome.19.coordinates")
  abso[[i]] <- a
}

This produces a list (abso) with 88 entries (1st entry below), so seemingly its only happened for the first value in table b.

   cpg_pos new_loc
1        1    2545
2        2    2541
3        3    2532
4        4    2508
5        5    2501
6        6    2413
7        7    2324
8        8    2265
9        9    2250
10      10    2246
11      11    2244
12      12    2216

I wonder if anyone can help with getting it to move sequentially through each value in B?

Thanks, Matt

CodePudding user response：

Joining is equivalent to filtering the cross-product. We can sort all combinations of rows from both tables to pick the one with the closest distance:

library(tidyverse)

# example data
genes <- tibble(gene = c("A", "A", "B"), gene_pos = c(1, 30, 50))
genes
#> # A tibble: 3 × 2
#>   gene  gene_pos
#>   <chr>    <dbl>
#> 1 A            1
#> 2 A           30
#> 3 B           50

cpgs <- tibble(cpg = seq(3), cpg_pos = c(48, 51, 31))
cpgs
#> # A tibble: 3 × 2
#>     cpg cpg_pos
#>   <int>   <dbl>
#> 1     1      48
#> 2     2      51
#> 3     3      31

cpgs %>%
  expand_grid(genes) %>%
  mutate(dist = abs(gene_pos - cpg_pos)) %>%
  group_by(cpg) %>%
  arrange(dist) %>%
  slice(1)
#> # A tibble: 3 × 5
#> # Groups:   cpg [3]
#>     cpg cpg_pos gene  gene_pos  dist
#>   <int>   <dbl> <chr>    <dbl> <dbl>
#> 1     1      48 B           50     2
#> 2     2      51 B           50     1
#> 3     3      31 A           30     1

^{Created on 2022-04-14 by the reprex package (v2.0.0)}

CPG number 1 is at position 48. The closest gene position is position 50 of gene B which is 2bp apart.