Home > database >  R: How Do I Find the Second Ranked Value in A Summarized Dataframe
R: How Do I Find the Second Ranked Value in A Summarized Dataframe

Time:11-16

Various R functions make it easy to use group_by and summarize to pull a value from a grouped variable. So in the resulting dataframe, I can use group_by and summarise to create, for example, a new column that contains the maximum or minimum value of a variable within each group. Meaning, with this data:

name, value
foo, 100
foo, 200
foo, 300
bar, 400
bar, 500
bar, 600

I can easily get the max or min for each value of name:

group_by(name) %>% summarize(maxValue = max(value)

But suppose I want the second ranked value for each name? Meaning suppose I want my result to be

name maxValue midValue
foo 300 200
bar 600 500

In other words, how do I fill in the blank in this:

df %>% group_by(name) %>% 
summarize(maxValue = max(value), 
  secondValue = _________)

Thanks, from an r newbie, for any help!

CodePudding user response:

library(dplyr)

df %>% 
  group_by(name) %>% 
  arrange(value) %>% 
  summarise(maxValue = max(value), 
            midValue = value[2])

Result:

# A tibble: 2 × 3
  name  maxValue midValue
  <chr>    <int>    <int>
1 Bar        600      500
2 Foo        300      200

CodePudding user response:

This should do:

df %>% group_by(name) %>% arrange(desc(value)) %>% slice(2)

Code:

a = 'name value
Foo 100
Foo 200
Foo 300
Bar 400
Bar 500
Bar 600'


df = read.table(text = a, header = T)
df %>% group_by(name) %>% arrange(desc(value)) %>% slice(2)

Output:

# A tibble: 2 × 2
# Groups:   name [2]
  name  value
  <fct> <int>
1 Bar     500
2 Foo     200

CodePudding user response:

library(dplyr)

df <- data.frame(
  name = c("Foo", "Foo", "Foo", "Bar", "Bar", "Bar"),
  value = c(100, 200, 300, 400, 500, 600)
)

df %>% 
  group_by(name) %>% 
  summarize(secondValue = sort(value, decreasing = TRUE)[2])

CodePudding user response:

All we need to do is pull out the 2 maximum values, and select the second largest one.

Note that the code below will handle ties (not important in your example but could be important with other data).

df %>% 
  group_by(name) %>% 
  summarise(maxvalue = sortN(value, 1),
            midvalue = sortN(value, 2)[2])
# A tibble: 2 × 3
  name  maxvalue midvalue
  <chr>    <int>    <int>
1 Bar        600      500
2 Foo        300      200

The function sortN(x, n, type = "max") is defined below. It pulls out the n max/min values of x. This is heavily based on the post in my comment. It is not necessary to define a whole function for this problem (as the other answers show), but I find this function useful for a range of problems, so it's nice to have.

sortN <- function(x, n, type = "max") {

  # GR 1 - If type is not "max" or "min", error
  if (! type %in% c("max", "min")) {
    stop("type must be max or min.")
  }

  # GR 2 - If n >= length(unique(x)), return whole vector
  if (n >= length(unique(x))){
    return(unique(x))
  }

  # Change based on whether the user wants min or max
  type <- switch(type, min = FALSE, max = TRUE)

  if (type) {
  x <- unique(x)
  partial <- length(x) - n   1
  out <- x[x >= sort(x, partial = partial)[partial]]
  sort(out, decreasing = TRUE)
  } else {
  out <- -sortN(x = -x, n = n, type = "max")
  sort(out, decreasing = FALSE)
  }
}
  •  Tags:  
  • r
  • Related