I have a data frame with dates and magnitudes. For every case where the dates are within 0.6 years from each other, I want to keep the date with the highest absolute magnitude and discard the other.
- This includes cases where multiple dates are all within 0.6 years from each other. Like
c(2014.2, 2014.4, 2014.5)
which should give `c(2014.4) if that year had the highest absolute magnitude. - For cases where multiple years could be chained using this criterion (like
c(2016.3, 2016.7, 2017.2)
, where 2016.3 and 2017.2 are not within 0.6 years from each other), I want to treat the dates that are closest to one another as a pair and consider the extra date in the criterion as a next candidate for another pair, (so the output will read like thisc(2016.3,
2016.7,2017.2)
if 2016.3 had the highest absolute magnitude).
data:
set.seed(1)
library(dplyr)
mydf <- data.frame(date = c(2014.25, 2014.41, 2014.53, 2016.3,
2016.7,2017.2,2018.5, 2019.35, 2019.8),
magnitude = round(rnorm(9, mean=0, sd=0.4),4))
mydf <- mydf %>% mutate(absmag = abs(magnitude))
mydf
> mydf
date magnitude absmag
1 2014.25 -0.1222 0.1222
2 2014.41 0.6047 0.6047
3 2014.53 0.1559 0.1559
4 2016.30 -0.2485 0.2485
5 2016.70 -0.8859 0.8859
6 2017.20 0.4500 0.4500
7 2018.50 -0.0180 0.0180
8 2019.35 -0.0065 0.0065
9 2019.80 0.3775 0.3775
Desired output:
date magnitude absmag
1 2014.41 0.6047 0.6047
2 2016.70 -0.8859 0.8859
3 2017.20 0.4500 0.4500
4 2018.50 -0.0180 0.0180
5 2019.80 0.3775 0.3775
The things I tried so far failed to incoropate the requirements in the bullet points.
This solution can only handle pairs of two items within 0.6 years from each other:
whichAreClose <- function(your.number, x, threshold = 0.6){
x[which(abs(x - your.number) != 0 & abs(x - your.number) < thresh)]}
out1 <- sapply(mydf$date,
FUN = whichAreClose,
x = mydf$date) %>%
unlist() %>%
split(., cut(seq_along(.), 2, labels = FALSE)) %>%
lapply(
., function(i){
mydf %>%
filter(date %in% i) %>%
slice_min(absmag)}) %>%
bind_rows(.) %>%
anti_join(mydf, .)
> out1
date magnitude absmag
1 2014.41 0.6047 0.6047
2 2014.53 0.1559 0.1559
3 2016.30 -0.2485 0.2485
4 2016.70 -0.8859 0.8859
5 2017.20 0.4500 0.4500
6 2018.50 -0.0180 0.0180
7 2019.80 0.3775 0.3775
and this solution cannot distinguish different pairs at all:
out2 <- mydf %>%
mutate(prevdist = abs(date - lag(date)),
nextdist = abs(date - lead(date)),
ispair = case_when(prevdist < 0.6 ~ 'yes',
nextdist < 0.6 ~ 'yes',
TRUE ~ 'no')) %>%
filter(ispair == 'yes') %>%
slice_min(absmag) %>%
anti_join(mydf, .)
> out2
date magnitude absmag
1 2014.25 -0.5883 0.5883
2 2014.41 -0.1913 0.1913
3 2014.53 0.1672 0.1672
4 2016.30 0.5435 0.5435
5 2017.20 0.1551 0.1551
6 2018.50 -0.0215 0.0215
7 2019.35 -0.5508 0.5508
8 2019.80 -0.1660 0.1660
P.S.: feel free to edit the title. I struggled to come up with a good one myself.
CodePudding user response:
You can use hclust
with the complete
method.
mydf |>
mutate(k = {
k <- dist(date, method = "manhattan") |>
hclust(method = "complete") |>
cutree(h = .6)
})
#> date magnitude absmag k
#>1 2014.25 -0.2506 0.2506 1
#>2 2014.41 0.0735 0.0735 1
#>3 2014.53 -0.3343 0.3343 1
#>4 2016.30 0.6381 0.6381 2
#>5 2016.70 0.1318 0.1318 2
#>6 2017.20 -0.3282 0.3282 3
#>7 2018.50 0.1950 0.1950 4
#>8 2019.35 0.2953 0.2953 5
#>9 2019.80 0.2303 0.2303 5
mydf |>
mutate(k = {
k <- dist(date, method = "manhattan") |>
hclust(method = "complete") |>
cutree(h = .6)
}) |>
group_by(k) |>
filter(absmag == max(absmag)) |>
as.data.frame()
#> date magnitude absmag k
#> 1 2014.53 -0.3343 0.3343 1
#> 2 2016.30 0.6381 0.6381 2
#> 3 2017.20 -0.3282 0.3282 3
#> 4 2018.50 0.1950 0.1950 4
#> 5 2019.35 0.2953 0.2953 5
CodePudding user response:
With cumsum
and purrr::accumulate
library(tidyverse)
mydf %>%
group_by(cum = cumsum(accumulate(c(0, diff(date)), ~ifelse(.x .y <= 0.6, .x .y, 0)) == 0)) %>%
slice(1)
date magnitude absmag cum
1 2014.25 -0.2506 0.2506 1
2 2016.30 0.6381 0.6381 2
3 2017.20 -0.3282 0.3282 3
4 2018.50 0.1950 0.1950 4
5 2019.35 0.2953 0.2953 5