Using the foverlaps
function from the data.table
package I get overlapping regions (it shows only 25 lines but it's more than 50 thousand) and I would like to group the overlapping regions for each id taking into account the following criteria:
If they have the same ID and overlapping regions belonging to the same or different group, then:
- group them all, 2) extend the range (i.e. start = min(overlapping item set) and end=max(overlapping item set)), and 3) place the name of the group of the maximum score. For example, given the data set:
dt <- data.table::data.table(
ID=c("1015_4_1_1","1015_4_1_1","1015_4_1_1","103335_0_1_2","103335_0_1_2",
"103335_0_1_2","11099_0_1_1","11099_0_1_1","11099_0_1_1","11099_0_1_1","11099_0_1_1",
"11702_0_1_1","11702_0_1_1","11702_0_1_1","11702_0_1_5","11702_0_1_5","11702_0_1_5",
"140331_0_1_1","140331_0_1_1","140331_0_1_1","14115_0_1_7","14115_0_1_7",
"14115_0_1_7","14115_0_1_8","14115_0_1_8"),
start=c(193,219,269,149,149,163,51,85,314,331,410,6193,6269,6278,6161,6238,6246,303,304,316,1525,1526,1546,1542,1543),
end=c(307,273,399,222,235,230,158,128,401,428,507,6355,6337,6356,6323,6305,6324,432,396,406,1603,1688,1612,1620,1705),
group=c("R7","R5","R5","R4","R5","R6","R7","R5","R4","R5","R5","R5","R6","R4","R5","R6","R4","R5","R4","R6","R4","R5","R6","R4","R5"),
score=c(394,291,409,296,319,271,318,252,292,329,252,524,326,360,464,340,335,515,506,386,332,501,307,308,443)
)
The expected result is:
# 1015_4_1_1 193 399 R5 409
# 103335_0_1_2 149 235 R5 319
# 11099_0_1_1 51 158 R7 318
# 11099_0_1_1 314 507 R5 329
# 11702_0_1_1 6193 6356 R5 524
# 11702_0_1_5 6161 6324 R5 464
# 140331_0_1_1 303 432 R5 515
# 14115_0_1_7 1525 1705 R5 501
note that for each ID there may be subgroups of regions that do not overlap each other, for example in "11099_0_1_1" rows 7 and 8 are grouped in one subgroup and the rest in another subgroup.
I have no experience with GenomicRanges
or IRanges
, and read in another comment that data.table
is usually faster. So, since I was expecting a lot of overlapping regions, I started with foverlaps from data.table
, but I don't know how to proceed. I hope you can help me, and thank you very much in advance
CodePudding user response:
If your group is the full ID
, then you could do:
dt <- dt[
,IDy := cumsum(fcoalesce( (start > (shift(cummax(end), type = 'lag') 1L)), 0L)), by = ID][
, .(start = min(start), end = max(end),
group = group[which.max(score)],
score = max(score)),
by = .(ID, IDy)][, IDy := NULL]
Output (additional 443
score added as it represents the 14115_0_1_8
):
ID start end group score
1: 1015_4_1_1 193 399 R5 409
2: 103335_0_1_2 149 235 R5 319
3: 11099_0_1_1 51 158 R7 318
4: 11099_0_1_1 314 507 R5 329
5: 11702_0_1_1 6193 6356 R5 524
6: 11702_0_1_5 6161 6324 R5 464
7: 140331_0_1_1 303 432 R5 515
8: 14115_0_1_7 1525 1688 R5 501
9: 14115_0_1_8 1541 1705 R5 443
In case your ID
group are actually only the numbers before the underscore, then:
library(data.table)
dt <- dt[, IDx := sub('_.*', '', ID)][
, IDy := cumsum(fcoalesce( (start > (shift(cummax(end), type = 'lag') 1L)), 0L)), by = IDx][
, .(ID = ID[which.max(score)],
start = min(start), end = max(end),
group = group[which.max(score)],
score = max(score)),
by = .(IDx, IDy)
][, c('IDx', 'IDy') := NULL]
Output (lacks 464
from your example):
dt
ID start end group score
1: 1015_4_1_1 193 399 R5 409
2: 103335_0_1_2 149 235 R5 319
3: 11099_0_1_1 51 158 R7 318
4: 11099_0_1_1 314 507 R5 329
5: 11702_0_1_1 6161 6356 R5 524
6: 140331_0_1_1 303 432 R5 515
7: 14115_0_1_7 1525 1705 R5 501
The above assumes that your start
variable is already ordered from lowest to highest. If this is not the case, just do the setorder(dt, start)
before executing the above code.