Home > Software engineering >  How to summarise grouped value increases
How to summarise grouped value increases

Time:11-21

I have this type of data:

df <- data.frame(
  Utt = c(rep("oh", 10), rep("ah", 10)),
  name = rep(LETTERS[1:2], 10),
  value = c(0.5,2,2,2,2,1,0,1,3.5,1,
            2.2,2.3,1.9,0.1,0.3,1.8,3,4,3.5,2)
)

I need to know whether within in each group of Utt and name, there are continuous value increases and how large these increases are.

EDIT: I've cobbled together this code, which produces the right result but seems convoluted:

df %>%
  # order by name:
  arrange(name) %>%
  group_by(name, Utt) %>%
  # mutate:
  mutate(
    # is there an increase from one value to the next?
    is_increase = ifelse(lag(value) < value, value, NA),
    # what's the difference between these values?
    diff = is_increase - lag(value)) %>%
  group_by(name, Utt, grp = rleid(!is.na(diff))) %>%
  # sum the contiguous values:
  summarise(increase_size = sum(diff, na.rm = TRUE)) %>%
  # remove 0 values:
  filter(!increase_size == 0) %>%
  # put same-group increase_sizes in the same row:
  summarise( 
            increase_size = str_c(increase_size, collapse = ', '))
# A tibble: 3 x 3
# Groups:   name [2]
  name  Utt   increase_size
  <chr> <chr> <chr>        
1 A     ah    3.2          
2 A     oh    1.5, 3.5     
3 B     ah    3.9 

NOTE: Ideally, the expected outcome would be:

1 A     ah    3.2          
2 A     oh    1.5, 3.5     
3 B     ah    3.9 
4 B     oh     NA 

Is there a better (i.e., more concise, more clever) dplyr solution?

CodePudding user response:

Use this function to find what you want.

f <- function(x) {
    ind <- which(x > lag(x))
    if (length(ind) == 0) {
        return(NA)
    }
    ind2 <- ind[which(lead(ind, default = max(ind)   2) - ind > 1)]
    ind1 <- ind[which(ind - lag(ind, default = min(ind) - 2) > 1)] - 1
    return(paste0(x[ind2] - x[ind1], collapse = ", "))
}

And use the function in summarise:

df %>% group_by(name, Utt) %>% summarise(increase = f(value))

CodePudding user response:

Using tidyverse, my solution was similar to yours. One possible modification might be to subset your columns before summing instead of filtering. This will keep all combinations of name and Utt and allow for NA for increase_size in the end. Since the column increase_size is character type, you can convert an empty string to NA.

library(data.table)
library(tidyverse)

df %>%
  arrange(name) %>%
  group_by(name, Utt) %>%
  mutate(diff = c(0, diff(value))) %>%
  group_by(grp = rleid(diff < 0), .add = T) %>%
  summarise(increase_size = sum(diff[diff > 0], na.rm = T)) %>%
  group_by(name, Utt) %>%
  summarise(increase_size = toString(increase_size[increase_size > 0])) %>%
  mutate(increase_size = na_if(increase_size, ""))

Output

  name  Utt   increase_size
  <chr> <chr> <chr>        
1 A     ah    3.2          
2 A     oh    1.5, 3.5     
3 B     ah    3.9          
4 B     oh    NA  
  • Related