Home > OS >  How to subset data frame with another data frame's columns and give name to subsets R
How to subset data frame with another data frame's columns and give name to subsets R

Time:06-22

I have one data frame (Table_A) with 3.4 million rows and 33 columns. I have another data frame with 384 rows and 3 columns (Table_B). (This is for one participant, I should have 40 at the end)

Table_A

Col1
100
143
178
245
265

Table_B

start stop name
101 144 Name1
154 254 Name2

What I want to do is subset Table A by Col1, by start and stop columns in Table B and give each subset row a name. To return

Table_A adapted

Col1 name
143 Name1
178 Name2
245 Name2

I have tried

df_sub <- subset(Table_A, (Col1 >= (Table_B$start)) & (Col1 <= (Table_B$stop))```
names <- Table_B$name[(Table_A$Col1 >= (Table_B$start)) & (Table_A$Col1 <= (Table_B$Col2))]

df_out <- cbind(df_sub, names)

However, df_sub seems to only return one/two rows per subset and there should be ~187 in half (192) and ~375 in the other half. Whereas names returns 2million rows.

I tried

(Table_A$Col1 >= (Table_B$Col1)) & (Table_A$Col1 <= (Table_B$Col2))

and this returns a list of False up to 384 and NA after

CodePudding user response:

Example using a for loop.

Say we have the following dataframes

#dt, corresponds to table a
dt <- structure(list(col1 = c(100, 143, 178, 245, 265)), class = "data.frame", row.names = c(NA, 
-5L))

#dt2, corresponds to table b
structure(list(start = c(101, 154), stop = c(144, 254), name = c("Name1", 
"Name2")), class = "data.frame", row.names = c(NA, -2L))

We could loop through the rows of table B and add the names to table A if col1 falls within the range. Using data.table:

setDT(dt) #table A to data.table
setDT(dt2) #table B to data.table

#add column 'name'
dt[, name := ""]

#loop through table B
for(i in 1:nrow(dt2)){
  dt[col1 >= dt2[i,]$start & col1 <= dt2[i,]$stop, name := paste0(name, dt2[i,]$name)]
}

Output

dt
  col1  name
1:  100      
2:  143 Name1
3:  178 Name2
4:  245 Name2
5:  265 

Note I used a paste0 when adding the name. Otherwise, in the loop the names would be overwritten if the range in table B is not unique. Now, if the values fall within two ranges in table B, the name column will get two names pasted after eachother.

CodePudding user response:

this is a job for a data.table non-equi join

library(data.table)
# if tablea and tableb are not already data.table
setDT(tablea);setDT(tableb)

# non-equi join
tablea[tableb, name := i.name, on = .(Col1 >= start, Col1 <= stop)]
#    Col1  name
# 1:  100  <NA>
# 2:  143 Name1
# 3:  178 Name2
# 4:  245 Name2
# 5:  265  <NA>

sample data

tablea <- fread("Col1
                100
                143
                178
                245
                265")
tableb <- fread("start  stop    name
                101     144     Name1
                154     254     Name2")
  • Related