Home > Software engineering >  How to calculate difference of measures between groups considering the date
How to calculate difference of measures between groups considering the date

Time:12-02

I'm struggling on how can I calculate the difference between the first and last value, arranged by date, by groups. Here is a toy example:

  test1 = data.frame(my_groups = c("A", "A", "A", "B", "B", "B", "C", "C", "C",  "A", "A", "A"),
                     measure = c(10, 20, 5, 64, 2, 62 ,2, 5, 4, 6, 7, 105),
                     #distance = c(),
                     time= as.Date(c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-05-2020", "20-06-2021", 
                             "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "19-01-2021"), format = "%d-%m-%Y"))
  
  # test1 %>% arrange(my_groups, time)
  #    my_groups measure       time
  # 1          A       5 2020-09-19
  # 2          A      10 2020-09-20
  # 3          A      20 2020-09-25
  # 4          A       6 2021-01-15
  # 5          A       7 2021-01-15
  # 6          A     105 2021-01-19
  # 7          B      64 2020-05-20
  # 8          B       2 2020-05-20
  # 9          B      62 2021-06-20
  # 10         C       2 2021-01-11
  # 11         C       5 2021-01-13
  # 12         C       1 2021-01-13

#desired result
  #    my_groups    diff            
  # 1          A     100 (105 - 5)
  # 2          B       2 (64 - 62)
  # 3          C       1 (1 - 2)

The equation inside the brackets in desired result is just to show where the diff came from.

Any hint on how can I do that?

CodePudding user response:

Your sample data in data.frame does not match your console output, so results will be different.

Two methods, depending on a few factors.

  1. Assuming that order is externally controlled,

    test1 %>%
      group_by(my_groups) %>%
      slice(c(1, n())) %>%
      summarize(diff = diff(measure))
    # # A tibble: 3 x 2
    #   my_groups  diff
    #   <chr>     <dbl>
    # 1 A            95
    # 2 B            -2
    # 3 C             2
    

    or just

    test1 %>%
      group_by(my_groups) %>%
      summarize(diff = measure[n()] - measure[1])
    

    The advantage of this is that it counters an issue with approach 2 below (ties in which.max): if you control the ordering yourself, you are guaranteed to use the first/last values you need.

    NOTE that for this portion, I assume that the order of data you gave us in your sample data is relevant. I'm assuming that there is some way to guarantee that your results are found. With your latest comment, we can arrange before the summarization and get closer to your desired results with

    test1 %>%
      arrange(time, -measure) %>%      # this is the "external" sorting I mentioned, so we don't need which.min/.max
      group_by(my_groups) %>%
      summarize(diff = measure[n()] - measure[1])
    # # A tibble: 3 x 2
    #   my_groups  diff
    #   <chr>     <dbl>
    # 1 A           100
    # 2 B            -2
    # 3 C             2
    
  2. Without pre-sorting, we can use which.min and which.max. The problem with this is that when ties occur, it may not choose the one that you want.

    test1 %>%
      group_by(my_groups) %>%
      summarize(diff = measure[which.max(time)] - measure[which.min(time)])
    # # A tibble: 3 x 2
    #   my_groups  diff
    #   <chr>     <dbl>
    # 1 A           100
    # 2 B            -2
    # 3 C             3
    

CodePudding user response:

test1 %>%
    dplyr::group_by(my_groups) %>%
    dplyr::mutate(
        first = min(time), last = max(time),
    ) %>% 
    dplyr::select(-time, -measure) %>%
    dplyr::distinct() %>%
    dplyr::mutate(diff = first - last) %>%
    dplyr::select(-first, -last)
  • Related