Home > Software engineering >  Spatial point distance analysis by group in R
Spatial point distance analysis by group in R

Time:09-17

I have a dataset which looks like this, though much larger

   ###   ##Fake data for stack exdb <- data.frame(zone =
 c(1,1,1,2,2,2),   site = c("study", "collect", "collect", "study",
 "collect", "collect"),   x = c(53.307726, 53.310660, 53.307089,
 53.313831, 53.319087, 53.318792),   y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815))

I need to run a point analysis between the STUDY site and the COLLECT site to see the distance in metres. The problem is that I have many different ZONES or groups that are all independent (i.e the distance from a point in zone 1 is irrelevant to a point in zone 2).

For this reason I need to do two things,

the point analysis, which computes the distance between the one study site per zone and the multiple collect sites in meters,

and then write a FOREACH or a LOOP function which calculates this distance for every group in the data set.

an optimal output would look like

exdb <- data.frame(zone = c(1,1,1,2,2,2),
  site = c("study", "collect", "collect", "study", "collect", "collect"),
  x = c(53.307726, 53.310660, 53.307089, 53.313831, 53.319087, 53.318792),
  y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815),
  dist = c(0, 10.3, 30.4, 0, 12.5, 11.2))

Where the study site in each zone is always 0, as it is the distance from this site, and the distance to each collect site is ONLY CALCULATED TO THE STUDY SITE IN EACH UNIQUE ZONE.

Thank you very much.

Kil

CodePudding user response:

Simple Base R version, no other packages required.

Starting with exdb as above.

First add a new column called dist with the value "study" because the plan is to self-merge on zone and site=="study":

> exdb$dist = "study"

Self-Merge, keeping only the coordinate columns:

> MM = merge(exdb, exdb,
    by.x=c("zone","site"),
    by.y=c("zone","dist"))[,c("x.x","y.x","x.y","y.y")]

Use distGeo to overwrite the dist column. Keeps it neat and tidy:

> exdb$dist = distGeo(MM[,2:1],MM[,4:3])
> exdb
  zone    site        x         y     dist
1    1   study 53.30773 -6.222291   0.0000
2    1 collect 53.31066 -6.217151 473.2943
3    1 collect 53.30709 -6.215080 485.8806
4    2   study 53.31383 -6.214152   0.0000
5    2 collect 53.31909 -6.218723 659.5238
6    2 collect 53.31879 -6.215815 563.1349

Returns same answer as @wimpel but with no additional dependencies and in fewer lines of code.

CodePudding user response:

Maybe something like this?

Assuming x and y are latitude and longitude, we can use the haversine function to get the distance in meters after pivoting the table to have both points in a row between which the distance is being calculated from (in meters):

library(tidyverse)
library(pracma)
#> 
#> Attaching package: 'pracma'
#> The following object is masked from 'package:purrr':
#> 
#>     cross

data <- data.frame(zone = c(1, 1, 1, 2, 2, 2), site = c(
  "study", "collect", "collect", "study",
  "collect", "collect"
), x = c(
  53.307726, 53.310660, 53.307089,
  53.313831, 53.319087, 53.318792
), y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815))

data %>%
  pivot_wider(names_from = site, values_from = c(x, y)) %>%
  unnest(y_collect, y_study, x_collect, x_study) %>%
  mutate(
    dist = list(x_study, y_study, x_collect, y_collect) %>% pmap_dbl(~haversine(c(..1, ..2), c(..3, ..4)) * 1000)
  )
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates

#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
#> Warning: unnest() has a new interface. See ?unnest for details.
#> Try `df %>% unnest(c(y_collect, y_study, x_collect, x_study))`, with `mutate()` if needed
#> # A tibble: 4 x 6
#>    zone x_study x_collect y_study y_collect  dist
#>   <dbl>   <dbl>     <dbl>   <dbl>     <dbl> <dbl>
#> 1     1    53.3      53.3   -6.22     -6.22  472.
#> 2     1    53.3      53.3   -6.22     -6.22  484.
#> 3     2    53.3      53.3   -6.21     -6.22  659.
#> 4     2    53.3      53.3   -6.21     -6.22  563.

Created on 2021-09-13 by the reprex package (v2.0.1)

CodePudding user response:

I'm still learning the spatial side but does this work?

library(sf)
library(tidyverse)

exdb %>%
  arrange(zone, desc(site)) %>% #ensure study is first
  st_as_sf(coords = c("x", "y"), crs = 4326) %>%
  group_by(zone) %>%
  mutate(
    study_coord = geometry[1],
    dist = st_distance(geometry, study_coord, by_element = T),
  )

CodePudding user response:

I believe this should work.. But I could not reproduce your distances in the desired output.

library(data.table)
library(purrr) # Or tidyverse
library(geosphere)
# Make your data a data.table
setDT(mydata)
# Split to a list based on zone and site
L <- split(mydata, by = c("zone", "site"), flatten = FALSE)
# Loop over list
L <- lapply(L, function(zone) {
  #get reference point to take dustance from
  point.study <- c(zone$study$y,zone$study$x)
  zone$study$dist <- 0
  # Calculate distance
  zone$collect$dist <- unlist(purrr::pmap( list(a = zone$collect$y, 
                                                b = zone$collect$x ), 
                                           ~(geosphere::distGeo( point.study, c(..1, ..2)))))
  return(zone)
})
# Rowbind the results together
data.table::rbindlist(lapply(L, data.table::rbindlist))
#    zone    site        x         y     dist
# 1:    1   study 53.30773 -6.222291   0.0000
# 2:    1 collect 53.31066 -6.217151 473.2943
# 3:    1 collect 53.30709 -6.215080 485.8806
# 4:    2   study 53.31383 -6.214152   0.0000
# 5:    2 collect 53.31909 -6.218723 659.5238
# 6:    2 collect 53.31879 -6.215815 563.1349
  • Related