I have two data.frame
in R that I need to calculate values based on a subset of columns in each of the two dataframes.
This is geolocated data, so I've been using geosphere
to do some of the calculations. These are the two dataframes. dat1
contains just a list of lat/long coordinates for random data points on the surface of the Earth. dat2
is a set of objects on Earth with a lat/long location and a speed/size associated with them.
library(geosphere)
set.seed(1)
dat1 <- data.frame( long = runif( n = 10, min = -180, max = 180 ),
lat = runif( n = 10, min = -90, max = 90 ) )
dat2 <- data.frame( long = runif( n = 10, min = -180, max = 180 ),
lat = runif( n = 10, min = -90, max = 90 ),
size = runif( n = 10, min = 0, max = 1500 ),
speed = rnorm( n = 10, mean = 100, sd = 30 ) )
I need to calculate the distance from each point in dat1
to all of the objects in dat2
, while also keeping the size and speed data. I've been doing this by creating a list of all the dat1
locations:
list.dat1 <- split( dat1, 1:nrow( dat1 ) )
And using a double for
loop (bad form in R, I know), but it works okay.
for( i in 1:length( list.dat1 ) ){
for( j in 1:nrow( dat2 ) ) {
two.points = matrix( c( dat2[ j, 'long'], dat2[j,'lat'], # create matrix
dat1[ i, 'long' ], dat1[ i,'lat' ] ), # column 1 and 2 is long lat
nrow = 2, ncol = 2, byrow = T ) # make a matrix of these two locations
## now add the data from the objects
list.dat1[[i]][j,3] = dat2[ j, 'long' ] # add impactor long
list.dat1[[i]][j,4] = dat2[ j, 'lat' ] # add impactor lat
## calculate distance
list.dat1[[i]][j,5] = distGeo( two.points )[1] / 1000 # distance kilometers
list.dat1[[i]][j,6] = dat2[ j, 'size' ] # add size of object
list.dat1[[i]][j,7] = dat2[ j, 'speed' ] # add speed of object
}
}
Then I just rename columns
for( i in 1:length( list.dat1) ) {
colnames( list.dat1[[i]]) = c( "point.long", "point.lat",
"object.long", "object.lat",
"distance.to.object", "object.size",
"object.speed" )
}
This is horribly inefficient as dat1
has 1000 rows and dat2
has anywhere from 100 to 100,000 rows.
I was thinking I could make list.dat1
a list of dataframes with length nrow(dat2)
but I'm not sure how to accomplish that.
Then I could simply cbind
the dat2
data into each list element in list.dat1
using lapply
?
Then, finally, do the distGeo()
calculation on each row of each list?
I'm still learning how to efficiently use R lists and apply()
suite functions, so any help on making this more efficient would be greatly appreciated!
CodePudding user response:
You can create a new dataframe that creates all combinations of points in dat1 and dat2 with tidyr::crossing
and then call the distGeo
function on the resulting dataframe.
library(dplyr)
library(tidyr)
dat1 %>%
# rename dat1 long,lat to distinguish between dat1,dat2
rename(dat1_long = long, dat1_lat = lat) %>%
# create all combinations
tidyr::crossing(dat2) %>%
# compute the distance between points
mutate(dist = distGeo(p1 = cbind(dat1_long, dat1_lat),
p2 = cbind(long, lat)))
CodePudding user response:
Here is a data.table approach which might help with your large datasets:
library(data.table)
setDT(dat1)[, `:=`(d1_id=.I, id=1)]
setDT(dat2)[, `:=`(d2_id=.I, id=1)]
dat1[dat2,on=.(id),allow.cartesian=T] %>%
.[, dist:=distGeo(cbind(long,lat), cbind(i.long,i.lat))] %>%
.[,.(d1_id, long, lat, d2_id, size, speed, dist)]
Output: (first six rows)
d1_id long lat d2_id size speed dist
1: 1 -84.41688 -52.9245765 1 1231.4194 98.31614 11610004
2: 2 -46.03540 -58.2197845 1 1231.4194 98.31614 12903703
3: 3 26.22721 33.6641124 1 1231.4194 98.31614 13868897
4: 4 146.95480 -20.8613307 1 1231.4194 98.31614 2208091
5: 5 -107.39450 48.5714556 1 1231.4194 98.31614 10727196
6: 6 143.42029 -0.4141364 1 1231.4194 98.31614 1487321