I have one data frame (Table_A) with 3.4 million rows and 33 columns. I have another data frame with 384 rows and 3 columns (Table_B). (This is for one participant, I should have 40 at the end)
Table_A
Col1 |
---|
100 |
143 |
178 |
245 |
265 |
Table_B
start | stop | name |
---|---|---|
101 | 144 | Name1 |
154 | 254 | Name2 |
What I want to do is subset Table A by Col1, by start and stop columns in Table B and give each subset row a name. To return
Table_A adapted
Col1 | name |
---|---|
143 | Name1 |
178 | Name2 |
245 | Name2 |
I have tried
df_sub <- subset(Table_A, (Col1 >= (Table_B$start)) & (Col1 <= (Table_B$stop))```
names <- Table_B$name[(Table_A$Col1 >= (Table_B$start)) & (Table_A$Col1 <= (Table_B$Col2))]
df_out <- cbind(df_sub, names)
However, df_sub seems to only return one/two rows per subset and there should be ~187 in half (192) and ~375 in the other half. Whereas names returns 2million rows.
I tried
(Table_A$Col1 >= (Table_B$Col1)) & (Table_A$Col1 <= (Table_B$Col2))
and this returns a list of False up to 384 and NA after
CodePudding user response:
Example using a for loop.
Say we have the following dataframes
#dt, corresponds to table a
dt <- structure(list(col1 = c(100, 143, 178, 245, 265)), class = "data.frame", row.names = c(NA,
-5L))
#dt2, corresponds to table b
structure(list(start = c(101, 154), stop = c(144, 254), name = c("Name1",
"Name2")), class = "data.frame", row.names = c(NA, -2L))
We could loop through the rows of table B and add the names to table A if col1 falls within the range. Using data.table
:
setDT(dt) #table A to data.table
setDT(dt2) #table B to data.table
#add column 'name'
dt[, name := ""]
#loop through table B
for(i in 1:nrow(dt2)){
dt[col1 >= dt2[i,]$start & col1 <= dt2[i,]$stop, name := paste0(name, dt2[i,]$name)]
}
Output
dt
col1 name
1: 100
2: 143 Name1
3: 178 Name2
4: 245 Name2
5: 265
Note I used a paste0
when adding the name. Otherwise, in the loop the names would be overwritten if the range in table B is not unique. Now, if the values fall within two ranges in table B, the name
column will get two names pasted after eachother.
CodePudding user response:
this is a job for a data.table non-equi join
library(data.table)
# if tablea and tableb are not already data.table
setDT(tablea);setDT(tableb)
# non-equi join
tablea[tableb, name := i.name, on = .(Col1 >= start, Col1 <= stop)]
# Col1 name
# 1: 100 <NA>
# 2: 143 Name1
# 3: 178 Name2
# 4: 245 Name2
# 5: 265 <NA>
sample data
tablea <- fread("Col1
100
143
178
245
265")
tableb <- fread("start stop name
101 144 Name1
154 254 Name2")