I have a dataframe consisting of a series of timestamps with lat-lon point locations relating to animal GPS tracking data, grouped into separate trips made by each animal. For each timestamped lat-lon, I also have the distance of the point to the animals' home colony (in km).
I would like to classify each point with whether or not it occurred before or after the animal reached its maximum distance from its home colony.
The aim is to have a column in the dataframe stating where or not the timestamped lat-lon occurs during the outward section of the animals' trip (defined as all points before the animal reached maximum distance to its home colony) or the return section (all points that occurred after the animal reached its maximum distance from its home colony and before it returned to the colony).
Here is example data from 2 trips:
My desired output is as follows - the below table, with the addition of the 'Loc_Class' (location classification) column, where MAX = maximum distance from the colony, OUT = points falling before the animal reaches that MAX, and RET= points where the animal has reached the maximum distance away from the colony and is returning back to it.
Trip_ID | Timestamp | LON | LAT | Colony_lat | Colony_lon | Dist_to_Colony | Loc_Class |
---|---|---|---|---|---|---|---|
A | 18/01/2022 14:00 | -2.81698 | -69.831474 | -71.89 | 5.159 | 369.9948202 | MAX |
A | 18/01/2022 14:30 | -2.750411 | -69.811873 | -71.89 | 5.159 | 369.5644383 | RET |
A | 18/01/2022 15:00 | -2.736943 | -69.811022 | -71.89 | 5.159 | 369.2463158 | RET |
A | 18/01/2022 15:30 | -2.645026 | -69.804136 | -71.89 | 5.159 | 367.1665826 | RET |
A | 18/01/2022 16:00 | -2.56825 | -69.833432 | -71.89 | 5.159 | 362.7877481 | RET |
B | 18/01/2022 21:30 | -3.046828 | -69.784849 | -71.89 | 5.159 | 380.0350746 | OUT |
B | 18/01/2022 22:00 | -3.080154 | -69.765688 | -71.89 | 5.159 | 382.4142364 | OUT |
B | 19/01/2022 00:30 | -3.025742 | -69.634483 | -71.89 | 5.159 | 390.8078861 | MAX |
B | 19/01/2022 01:00 | -2.898522 | -69.672147 | -71.89 | 5.159 | 384.3511473 | RET |
B | 19/01/2022 01:30 | -2.907463 | -69.769916 | -71.89 | 5.159 | 377.173593 | RET |
library(tidyverse)
library(dplyr)
library(geosphere)
#load dataframe
df <- read.csv("Tracking_Data.csv")
#Great circle (geodesic) - add the great circle distance between the timestamped location and the animals' colony
df_2 <- df %>% mutate(dist_to_colony = distGeo(cbind(LON, LAT), cbind(Colony_lon, Colony_lat)))
#change distance from colony from m to km
df_2 <- df_2 %>% mutate(dist_to_colony = dist_to_colony/1000)
#find the point at which the maximum distance to colony occurs for each animals' trips
Max_dist_colony <- df_2 %>% group_by(TripID) %>% summarise(across(c(dist_to_colony), max))
#so now I need to classify each point using the 'Timestamp' and 'Dist_to_Colony' column and make a 'Loc_Class' column:
#example df
| Trip_ID | Timestamp | LON | LAT |Colony_lat|Colony_lon|Dist_to_Colony|
| -------- | -----------------|----------------------|--------- |--------- |------------- |
|A |18/01/2022 14:00 |-2.81698 |-69.831474 | -71.89 |5.159 |369.9948202 |
|A |18/01/2022 14:30 |-2.750411|-69.811873 | -71.89 |5.159 |369.5644383 |
|A |18/01/2022 15:00 |-2.736943|-69.811022 | -71.89 |5.159 |369.2463158 |
|A |18/01/2022 15:30 |-2.645026|-69.804136 | -71.89 |5.159 |367.1665826 |
|A |18/01/2022 16:00 |-2.56825 |-69.833432 | -71.89 |5.159 |362.7877481 |
|B |18/01/2022 21:30 |-3.046828|-69.784849 | -71.89 |5.159 |380.0350746 |
|B |18/01/2022 22:00 |-3.080154|-69.765688 | -71.89 |5.159 |382.4142364 |
|B |19/01/2022 00:30 |-3.025742|-69.634483 | -71.89 |5.159 |390.8078861 |
|B |19/01/2022 01:00 |-2.898522|-69.672147 | -71.89 |5.159 |384.3511473 |
|B |19/01/2022 01:30 |-2.907463|-69.769916 | -71.89 |5.159 |377.173593 |
CodePudding user response:
Something like this?
comp3 <- function(vec, val, out = -1:1) ifelse(abs(vec - val) < 1e-9, out[2], ifelse(vec < val, out[1], out[3]))
quux %>%
group_by(Trip_ID) %>%
mutate(Direction = comp3(row_number(), which.max(Dist_to_Colony), c("OUT", "MAX", "RET"))) %>%
ungroup()
# # A tibble: 10 x 9
# Trip_ID Timestamp LON LAT Colony_lat Colony_lon Dist_to_Colony Loc_Class Direction
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
# 1 A 18/01/2022 14:00 -2.82 -69.8 -71.9 5.16 370. MAX MAX
# 2 A 18/01/2022 14:30 -2.75 -69.8 -71.9 5.16 370. RET RET
# 3 A 18/01/2022 15:00 -2.74 -69.8 -71.9 5.16 369. RET RET
# 4 A 18/01/2022 15:30 -2.65 -69.8 -71.9 5.16 367. RET RET
# 5 A 18/01/2022 16:00 -2.57 -69.8 -71.9 5.16 363. RET RET
# 6 B 18/01/2022 21:30 -3.05 -69.8 -71.9 5.16 380. OUT OUT
# 7 B 18/01/2022 22:00 -3.08 -69.8 -71.9 5.16 382. OUT OUT
# 8 B 19/01/2022 00:30 -3.03 -69.6 -71.9 5.16 391. MAX MAX
# 9 B 19/01/2022 01:00 -2.90 -69.7 -71.9 5.16 384. RET RET
# 10 B 19/01/2022 01:30 -2.91 -69.8 -71.9 5.16 377. RET RET
The comp3
function is really just a ternary-result comparison function: instead of something like (vec > val)
that returns just 0
(false) and 1
(true), this gives a third result when they are equal. For example,
comp3(1:5, 4)
# [1] -1 -1 -1 0 1
The extension to that is the out=
argument that allows the user to specify what the three values should be instead of -1:1
. (If you want to shorten the dplyr code, feel free to hard-code the default value of out=
to be your string vector.
Another note: the use of abs(vec - val) < 1e-9
is another step towards generalizing it: if given floating-point (numeric
) values, we might be subject to problems with strict floating-point equality for numbers of high precision (c.f., Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). In this case it's a little overkill, but it will not return a different value. (And since you talk of a table with 4000 or so locations, the "overhead" of doing this one extra step will likely not be human-apparent.)