Home > Net >  R keep rows with maximum value of one column when multiple rows have values close to each other in a
R keep rows with maximum value of one column when multiple rows have values close to each other in a

Time:03-15

I have a data frame with dates and magnitudes. For every case where the dates are within 0.6 years from each other, I want to keep the date with the highest absolute magnitude and discard the other.

  • This includes cases where multiple dates are all within 0.6 years from each other. Like c(2014.2, 2014.4, 2014.5) which should give `c(2014.4) if that year had the highest absolute magnitude.
  • For cases where multiple years could be chained using this criterion (like c(2016.3, 2016.7, 2017.2), where 2016.3 and 2017.2 are not within 0.6 years from each other), I want to treat the dates that are closest to one another as a pair and consider the extra date in the criterion as a next candidate for another pair, (so the output will read like this c(2016.3, 2016.7, 2017.2) if 2016.3 had the highest absolute magnitude).

data:

set.seed(1)
library(dplyr)
mydf <- data.frame(date = c(2014.25, 2014.41, 2014.53, 2016.3, 
                            2016.7,2017.2,2018.5, 2019.35, 2019.8),
                   magnitude = round(rnorm(9, mean=0, sd=0.4),4))
mydf <- mydf %>% mutate(absmag = abs(magnitude))
mydf
> mydf
     date magnitude absmag
1 2014.25   -0.1222 0.1222
2 2014.41    0.6047 0.6047
3 2014.53    0.1559 0.1559
4 2016.30   -0.2485 0.2485
5 2016.70   -0.8859 0.8859
6 2017.20    0.4500 0.4500
7 2018.50   -0.0180 0.0180
8 2019.35   -0.0065 0.0065
9 2019.80    0.3775 0.3775

Desired output:

     date magnitude absmag
1 2014.41    0.6047 0.6047
2 2016.70   -0.8859 0.8859
3 2017.20    0.4500 0.4500
4 2018.50   -0.0180 0.0180
5 2019.80    0.3775 0.3775

The things I tried so far failed to incoropate the requirements in the bullet points.

This solution can only handle pairs of two items within 0.6 years from each other:

whichAreClose <- function(your.number, x, threshold = 0.6){
  x[which(abs(x - your.number) != 0 & abs(x - your.number) < thresh)]}
out1 <- sapply(mydf$date, 
                FUN = whichAreClose, 
                x = mydf$date) %>% 
  unlist() %>% 
  split(., cut(seq_along(.), 2, labels = FALSE)) %>% 
  lapply(
    ., function(i){
      mydf %>% 
        filter(date %in% i) %>% 
        slice_min(absmag)}) %>% 
  bind_rows(.) %>% 
  anti_join(mydf, .)

> out1
     date magnitude absmag
1 2014.41    0.6047 0.6047
2 2014.53    0.1559 0.1559
3 2016.30   -0.2485 0.2485
4 2016.70   -0.8859 0.8859
5 2017.20    0.4500 0.4500
6 2018.50   -0.0180 0.0180
7 2019.80    0.3775 0.3775

and this solution cannot distinguish different pairs at all:

out2 <- mydf %>% 
  mutate(prevdist = abs(date - lag(date)),
         nextdist = abs(date - lead(date)),
         ispair = case_when(prevdist < 0.6 ~ 'yes',
                            nextdist < 0.6 ~ 'yes',
                            TRUE ~ 'no')) %>% 
  filter(ispair == 'yes') %>% 
  slice_min(absmag) %>% 
  anti_join(mydf, .)

> out2
     date magnitude absmag
1 2014.25   -0.5883 0.5883
2 2014.41   -0.1913 0.1913
3 2014.53    0.1672 0.1672
4 2016.30    0.5435 0.5435
5 2017.20    0.1551 0.1551
6 2018.50   -0.0215 0.0215
7 2019.35   -0.5508 0.5508
8 2019.80   -0.1660 0.1660

P.S.: feel free to edit the title. I struggled to come up with a good one myself.

CodePudding user response:

You can use hclust with the complete method.

mydf |>
  mutate(k = {
    k <- dist(date, method = "manhattan") |>
      hclust(method = "complete") |>
      cutree(h = .6)
  })

#>     date magnitude absmag k
#>1 2014.25   -0.2506 0.2506 1
#>2 2014.41    0.0735 0.0735 1
#>3 2014.53   -0.3343 0.3343 1
#>4 2016.30    0.6381 0.6381 2
#>5 2016.70    0.1318 0.1318 2
#>6 2017.20   -0.3282 0.3282 3
#>7 2018.50    0.1950 0.1950 4
#>8 2019.35    0.2953 0.2953 5
#>9 2019.80    0.2303 0.2303 5
mydf |>
  mutate(k = {
    k <- dist(date, method = "manhattan") |>
      hclust(method = "complete") |>
      cutree(h = .6)
  }) |>
  group_by(k) |>
  filter(absmag == max(absmag)) |>
  as.data.frame()

#>      date magnitude absmag k
#> 1 2014.53   -0.3343 0.3343 1
#> 2 2016.30    0.6381 0.6381 2
#> 3 2017.20   -0.3282 0.3282 3
#> 4 2018.50    0.1950 0.1950 4
#> 5 2019.35    0.2953 0.2953 5

CodePudding user response:

With cumsum and purrr::accumulate

library(tidyverse)

mydf %>% 
  group_by(cum = cumsum(accumulate(c(0, diff(date)), ~ifelse(.x   .y <= 0.6, .x   .y, 0)) == 0)) %>% 
  slice(1)

     date magnitude absmag    cum
1 2014.25   -0.2506 0.2506      1
2 2016.30    0.6381 0.6381      2
3 2017.20   -0.3282 0.3282      3
4 2018.50    0.1950 0.1950      4
5 2019.35    0.2953 0.2953      5
  •  Tags:  
  • r
  • Related