Home > Blockchain >  How to add a column to my data frame that calculates the distance between lat/long points between th
How to add a column to my data frame that calculates the distance between lat/long points between th

Time:11-15

I have a data frame of individual animals with a unique ID, the lat/long where they were found, and the date they were found. The database has frequent returns of the same individual. I have over 2000 individuals. I want to add a column to my data frame to calculate euclidian distance between current location & previous location. I want to add a second column to tell me which calculation number I'm on for each individual. The data frame is already organized by sequential date. I'm trying to solve this in R.

Event ID Lat Long
1 1 31.89 -80.98
2 2 31.54 -80.12
3 1 31.45 -81.92
4 1 31.64 -81.82
5 2 31.23 -80.98

Add a column so that now it looks like

Event ID Lat Long Dist. Calculation #
1 1 31.89 -80.98 - 0
2 2 31.54 -80.12 - 0
3 1 31.45 -81.92 Distance between event 1 & 3 1
4 1 31.64 -81.82 Distance between event 3 & 4 2
5 2 31.23 -80.98 Distance between event 2 & 5 1

Is there a faster way to do this without a for loop? I'm stuck on where to start. I know I can use a distance function from the geospatial package once, I have the uniqueID sorted, but I'm having trouble iterating through my data.

CodePudding user response:

Here is one option which leans on the sf package and dplyr. The function sf::st_distance calculates distances between pairs of points, and dplyr::lag can be used to look "one row behind". You will want to confirm your coordinate system, which I guessed here is WGS84/4326.

library(dplyr)
library(sf)



dat <- read.table(text = " Event    ID  Lat Long
1   1   31.89   -80.98
2   2   31.54   -80.12
3   1   31.45   -81.92
4   1   31.64   -81.82
5   2   31.23   -80.98", h = T)


dat_sf <- st_as_sf(dat, coords = c('Long', 'Lat'), crs = 4326)


dat_sf %>%
  arrange(ID) %>%
  group_by(ID) %>%
  mutate(distance = as.numeric(st_distance(geometry, lag(geometry), by_element = TRUE)),
         calculation = row_number() - 1)
#> Simple feature collection with 5 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -81.92 ymin: 31.23 xmax: -80.12 ymax: 31.89
#> Geodetic CRS:  WGS 84
#> # A tibble: 5 x 5
#> # Groups:   ID [2]
#>   Event    ID       geometry distance calculation
#> * <int> <int>    <POINT [°]>    <dbl>       <dbl>
#> 1     1     1 (-80.98 31.89)      NA            0
#> 2     3     1 (-81.92 31.45)  101524.           1
#> 3     4     1 (-81.82 31.64)   23155.           2
#> 4     2     2 (-80.12 31.54)      NA            0
#> 5     5     2 (-80.98 31.23)   88615.           1

Created on 2022-11-14 by the reprex package (v2.0.0)

CodePudding user response:

Try this:

  1. load library geosphere
  2. create demo data
  3. get all unique IDs and sort dataframe by ID and event
  4. append last known coords of each animal to each row
  5. apply distance function to each row
library(geosphere)
df <- data.frame(
    event = seq(5),
    id = c(1, 2, 1, 1, 2),
    lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
    long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)

keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)
df <- keys %>% lapply(
    function(key){
        tmp <- df[df$id == key, ]
        tmp$last_lat <- tmp$lat
        tmp$last_long <- tmp$long
        tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
        tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
        tmp %>% return
    }
) %>% do.call(rbind, .)


df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))

Since you said you need speed, below is the same code as above but run in parallel:

library(tictoc)
library(parallel)

tic()
clust <- makeCluster(detectCores() - 1)

df <- data.frame(
    event = seq(5),
    id = c(1, 2, 1, 1, 2),
    lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
    long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)
keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)

clusterExport(clust, "df")
clusterEvalQ(clust, library(magrittr))
df <- keys %>% parLapply(
    clust, ., 
    function(key){
        tmp <- df[df$id == key, ]
        tmp$last_lat <- tmp$lat
        tmp$last_long <- tmp$long
        tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
        tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
        tmp %>% return
    }
) %>% do.call(rbind, .)

df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))

toc()

Above, tictoc just records the execution time. I just created a cluster with the number of your cpu cores minus 1, and changed the lapply part to parLapply The second version will be slower than the first if you have a small dataset (due to overhead setting up the parallel computation). But if you have a large dataset, the second version will be much faster.

  • Related