I have a list of all CpG locations (base pair value) for a gene on a methylation array in one table (table a), and another table with the locations (base pair value) for 12 CpGs for the same gene not present on the array (table b). I am trying to work out for each probe in table_b, which probe in table_a is the closest in bp.
i.e. table_a
# A tibble: 88 x 2
UCSC_RefGene_Name pos
<chr> <int>
1 RXRA 137218280
2 RXRA 137243592
3 RXRA 137330570
4 RXRA 137225311
5 RXRA 137299436
6 RXRA 137277819
7 RXRA 137268074
8 RXRA 137255666
9 RXRA;RXRA 137284989
10 RXRA 137218286
# ... with 78 more rows
table_b
CpG.position Human.genome.19.coordinates
1 1 137215735
2 2 137215739
3 3 137215748
4 4 137215772
5 5 137215779
6 6 137215867
7 7 137215956
8 8 137216015
9 9 137216030
10 10 137216034
11 11 137216036
12 12 137216064
My first step was to sequentially subtract the each value in A from the first row in B -
bibs <- function(table, value, column){
position <- sym(column)
smaps <-
table %>%
summarise(
"cpg_pos" = table$CpG.position,
"new_loc" = value - {{position}})
print(smaps)
}
posns <- table_a$positions
abso <- list()
for(i in seq_along(posns)){
a <- bibs(table_b, posns[[i]], "Human.genome.19.coordinates")
abso[[i]] <- a
}
This produces a list (abso) with 88 entries (1st entry below), so seemingly its only happened for the first value in table b.
cpg_pos new_loc
1 1 2545
2 2 2541
3 3 2532
4 4 2508
5 5 2501
6 6 2413
7 7 2324
8 8 2265
9 9 2250
10 10 2246
11 11 2244
12 12 2216
I wonder if anyone can help with getting it to move sequentially through each value in B?
Thanks, Matt
CodePudding user response:
Joining is equivalent to filtering the cross-product. We can sort all combinations of rows from both tables to pick the one with the closest distance:
library(tidyverse)
# example data
genes <- tibble(gene = c("A", "A", "B"), gene_pos = c(1, 30, 50))
genes
#> # A tibble: 3 × 2
#> gene gene_pos
#> <chr> <dbl>
#> 1 A 1
#> 2 A 30
#> 3 B 50
cpgs <- tibble(cpg = seq(3), cpg_pos = c(48, 51, 31))
cpgs
#> # A tibble: 3 × 2
#> cpg cpg_pos
#> <int> <dbl>
#> 1 1 48
#> 2 2 51
#> 3 3 31
cpgs %>%
expand_grid(genes) %>%
mutate(dist = abs(gene_pos - cpg_pos)) %>%
group_by(cpg) %>%
arrange(dist) %>%
slice(1)
#> # A tibble: 3 × 5
#> # Groups: cpg [3]
#> cpg cpg_pos gene gene_pos dist
#> <int> <dbl> <chr> <dbl> <dbl>
#> 1 1 48 B 50 2
#> 2 2 51 B 50 1
#> 3 3 31 A 30 1
Created on 2022-04-14 by the reprex package (v2.0.0)
CPG number 1 is at position 48. The closest gene position is position 50 of gene B which is 2bp apart.