R keep rows with maximum value of one column when multiple rows have values close to each other in a-CodePudding

I have a data frame with dates and magnitudes. For every case where the dates are within 0.6 years from each other, I want to keep the date with the highest absolute magnitude and discard the other.

This includes cases where multiple dates are all within 0.6 years from each other. Like c(2014.2, 2014.4, 2014.5) which should give `c(2014.4) if that year had the highest absolute magnitude.
For cases where multiple years could be chained using this criterion (like c(2016.3, 2016.7, 2017.2), where 2016.3 and 2017.2 are not within 0.6 years from each other), I want to treat the dates that are closest to one another as a pair and consider the extra date in the criterion as a next candidate for another pair, (so the output will read like this c(2016.3, ~~2016.7~~, 2017.2) if 2016.3 had the highest absolute magnitude).

data:

set.seed(1)
library(dplyr)
mydf <- data.frame(date = c(2014.25, 2014.41, 2014.53, 2016.3, 
                            2016.7,2017.2,2018.5, 2019.35, 2019.8),
                   magnitude = round(rnorm(9, mean=0, sd=0.4),4))
mydf <- mydf %>% mutate(absmag = abs(magnitude))
mydf
> mydf
     date magnitude absmag
1 2014.25   -0.1222 0.1222
2 2014.41    0.6047 0.6047
3 2014.53    0.1559 0.1559
4 2016.30   -0.2485 0.2485
5 2016.70   -0.8859 0.8859
6 2017.20    0.4500 0.4500
7 2018.50   -0.0180 0.0180
8 2019.35   -0.0065 0.0065
9 2019.80    0.3775 0.3775

Desired output:

     date magnitude absmag
1 2014.41    0.6047 0.6047
2 2016.70   -0.8859 0.8859
3 2017.20    0.4500 0.4500
4 2018.50   -0.0180 0.0180
5 2019.80    0.3775 0.3775

The things I tried so far failed to incoropate the requirements in the bullet points.

This solution can only handle pairs of two items within 0.6 years from each other:

whichAreClose <- function(your.number, x, threshold = 0.6){
  x[which(abs(x - your.number) != 0 & abs(x - your.number) < thresh)]}
out1 <- sapply(mydf$date, 
                FUN = whichAreClose, 
                x = mydf$date) %>% 
  unlist() %>% 
  split(., cut(seq_along(.), 2, labels = FALSE)) %>% 
  lapply(
    ., function(i){
      mydf %>% 
        filter(date %in% i) %>% 
        slice_min(absmag)}) %>% 
  bind_rows(.) %>% 
  anti_join(mydf, .)

> out1
     date magnitude absmag
1 2014.41    0.6047 0.6047
2 2014.53    0.1559 0.1559
3 2016.30   -0.2485 0.2485
4 2016.70   -0.8859 0.8859
5 2017.20    0.4500 0.4500
6 2018.50   -0.0180 0.0180
7 2019.80    0.3775 0.3775

and this solution cannot distinguish different pairs at all:

out2 <- mydf %>% 
  mutate(prevdist = abs(date - lag(date)),
         nextdist = abs(date - lead(date)),
         ispair = case_when(prevdist < 0.6 ~ 'yes',
                            nextdist < 0.6 ~ 'yes',
                            TRUE ~ 'no')) %>% 
  filter(ispair == 'yes') %>% 
  slice_min(absmag) %>% 
  anti_join(mydf, .)

> out2
     date magnitude absmag
1 2014.25   -0.5883 0.5883
2 2014.41   -0.1913 0.1913
3 2014.53    0.1672 0.1672
4 2016.30    0.5435 0.5435
5 2017.20    0.1551 0.1551
6 2018.50   -0.0215 0.0215
7 2019.35   -0.5508 0.5508
8 2019.80   -0.1660 0.1660

P.S.: feel free to edit the title. I struggled to come up with a good one myself.

CodePudding user response：

You can use hclust with the complete method.

mydf |>
  mutate(k = {
    k <- dist(date, method = "manhattan") |>
      hclust(method = "complete") |>
      cutree(h = .6)
  })

#>     date magnitude absmag k
#>1 2014.25   -0.2506 0.2506 1
#>2 2014.41    0.0735 0.0735 1
#>3 2014.53   -0.3343 0.3343 1
#>4 2016.30    0.6381 0.6381 2
#>5 2016.70    0.1318 0.1318 2
#>6 2017.20   -0.3282 0.3282 3
#>7 2018.50    0.1950 0.1950 4
#>8 2019.35    0.2953 0.2953 5
#>9 2019.80    0.2303 0.2303 5

mydf |>
  mutate(k = {
    k <- dist(date, method = "manhattan") |>
      hclust(method = "complete") |>
      cutree(h = .6)
  }) |>
  group_by(k) |>
  filter(absmag == max(absmag)) |>
  as.data.frame()

#>      date magnitude absmag k
#> 1 2014.53   -0.3343 0.3343 1
#> 2 2016.30    0.6381 0.6381 2
#> 3 2017.20   -0.3282 0.3282 3
#> 4 2018.50    0.1950 0.1950 4
#> 5 2019.35    0.2953 0.2953 5

CodePudding user response：

With cumsum and purrr::accumulate

library(tidyverse)

mydf %>% 
  group_by(cum = cumsum(accumulate(c(0, diff(date)), ~ifelse(.x   .y <= 0.6, .x   .y, 0)) == 0)) %>% 
  slice(1)

     date magnitude absmag    cum
1 2014.25   -0.2506 0.2506      1
2 2016.30    0.6381 0.6381      2
3 2017.20   -0.3282 0.3282      3
4 2018.50    0.1950 0.1950      4
5 2019.35    0.2953 0.2953      5