I have a data frame of individual animals with a unique ID, the lat/long where they were found, and the date they were found. The database has frequent returns of the same individual. I have over 2000 individuals. I want to add a column to my data frame to calculate euclidian distance between current location & previous location. I want to add a second column to tell me which calculation number I'm on for each individual. The data frame is already organized by sequential date. I'm trying to solve this in R.
Event | ID | Lat | Long |
---|---|---|---|
1 | 1 | 31.89 | -80.98 |
2 | 2 | 31.54 | -80.12 |
3 | 1 | 31.45 | -81.92 |
4 | 1 | 31.64 | -81.82 |
5 | 2 | 31.23 | -80.98 |
Add a column so that now it looks like
Event | ID | Lat | Long | Dist. | Calculation # |
---|---|---|---|---|---|
1 | 1 | 31.89 | -80.98 | - | 0 |
2 | 2 | 31.54 | -80.12 | - | 0 |
3 | 1 | 31.45 | -81.92 | Distance between event 1 & 3 | 1 |
4 | 1 | 31.64 | -81.82 | Distance between event 3 & 4 | 2 |
5 | 2 | 31.23 | -80.98 | Distance between event 2 & 5 | 1 |
Is there a faster way to do this without a for loop? I'm stuck on where to start. I know I can use a distance function from the geospatial package once, I have the uniqueID sorted, but I'm having trouble iterating through my data.
CodePudding user response:
Here is one option which leans on the sf
package and dplyr
. The function sf::st_distance
calculates distances between pairs of points, and dplyr::lag
can be used to look "one row behind". You will want to confirm your coordinate system, which I guessed here is WGS84
/4326
.
library(dplyr)
library(sf)
dat <- read.table(text = " Event ID Lat Long
1 1 31.89 -80.98
2 2 31.54 -80.12
3 1 31.45 -81.92
4 1 31.64 -81.82
5 2 31.23 -80.98", h = T)
dat_sf <- st_as_sf(dat, coords = c('Long', 'Lat'), crs = 4326)
dat_sf %>%
arrange(ID) %>%
group_by(ID) %>%
mutate(distance = as.numeric(st_distance(geometry, lag(geometry), by_element = TRUE)),
calculation = row_number() - 1)
#> Simple feature collection with 5 features and 4 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -81.92 ymin: 31.23 xmax: -80.12 ymax: 31.89
#> Geodetic CRS: WGS 84
#> # A tibble: 5 x 5
#> # Groups: ID [2]
#> Event ID geometry distance calculation
#> * <int> <int> <POINT [°]> <dbl> <dbl>
#> 1 1 1 (-80.98 31.89) NA 0
#> 2 3 1 (-81.92 31.45) 101524. 1
#> 3 4 1 (-81.82 31.64) 23155. 2
#> 4 2 2 (-80.12 31.54) NA 0
#> 5 5 2 (-80.98 31.23) 88615. 1
Created on 2022-11-14 by the reprex package (v2.0.0)
CodePudding user response:
Try this:
- load library geosphere
- create demo data
- get all unique IDs and sort dataframe by ID and event
- append last known coords of each animal to each row
- apply distance function to each row
library(geosphere)
df <- data.frame(
event = seq(5),
id = c(1, 2, 1, 1, 2),
lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)
keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)
df <- keys %>% lapply(
function(key){
tmp <- df[df$id == key, ]
tmp$last_lat <- tmp$lat
tmp$last_long <- tmp$long
tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
tmp %>% return
}
) %>% do.call(rbind, .)
df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))
Since you said you need speed, below is the same code as above but run in parallel:
library(tictoc)
library(parallel)
tic()
clust <- makeCluster(detectCores() - 1)
df <- data.frame(
event = seq(5),
id = c(1, 2, 1, 1, 2),
lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)
keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)
clusterExport(clust, "df")
clusterEvalQ(clust, library(magrittr))
df <- keys %>% parLapply(
clust, .,
function(key){
tmp <- df[df$id == key, ]
tmp$last_lat <- tmp$lat
tmp$last_long <- tmp$long
tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
tmp %>% return
}
) %>% do.call(rbind, .)
df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))
toc()
Above, tictoc
just records the execution time. I just created a cluster with the number of your cpu cores minus 1, and changed the lapply
part to parLapply
The second version will be slower than the first if you have a small dataset (due to overhead setting up the parallel computation). But if you have a large dataset, the second version will be much faster.