Home > other >  Function that obtains the nth largest group (or element) of a dataframe
Function that obtains the nth largest group (or element) of a dataframe

Time:12-31

I'm trying to create a function, that returns the nth largest group (or element if only 1 exists), of a data frame by uniquely sorting a column of the data frame and then passing that argument as a row argument into the data frame. I'm new at R and I'm in a data science post graduate program. My function seems to not work, it only returns the column names, However, the hard code, does work. I posted my results below.

Just a background, I'm, using a car database. I'm trying to return not just the nth largest value, but all elements of the data frame that correspond to the nth largest price in the price column. The hard code works exactly how I want it to. But the function does not.

nth_largest_group <- function(data_frame, column, n) {
    target_col <- data_frame$column
    uni_sorted <- unique(sort(target_col, T))[n]
    nth_max <- data_frame[target_col == uni_sorted,]
    return(nth_max)        
  }

# [1] price        brand        model        year         title_status mileage      color        vin         
# [9] lot          state        country      condition   
# <0 rows> (or 0-length row.names)

target_col <- US_Car_df$price
uni_sorted <- unique(sort(target_col, T))[1]   
nth_max <- US_Car_df[target_col == uni_sorted, ]

print(nth_max)

# price         brand    model year  title_status mileage  color                 vin       lot   state country
# 503 84900 mercedes-benz sl-class 2017 clean vehicle   25302 silver   wddjk7ea3hf044968 167607883 florida     usa
# condition
# 503 2 days left

Data

# dput(head(US_Car_df))
US_Car_df <- structure(list(price = c(6300L, 2899L, 5350L, 25000L, 27700L, 
                         5700L), brand = c("toyota", "ford", "dodge", "ford", "chevrolet", 
                                           "dodge"), model = c("cruiser", "se", "mpv", "door", "1500", "mpv"
                                           ), year = c(2008L, 2011L, 2018L, 2014L, 2018L, 2018L), title_status = c("clean vehicle", 
                                                                                                                   "clean vehicle", "clean vehicle", "clean vehicle", "clean vehicle", 
                                                                                                                   "clean vehicle"), mileage = c(274117L, 190552L, 39590L, 64146L, 
                                                                                                                                                 6654L, 45561L), color = c("black", "silver", "silver", "blue", 
                                                                                                                                                                           "red", "white"), vin = c("  jtezu11f88k007763", "  2fmdk3gc4bbb02217", 
                                                                                                                                                                                                    "  3c4pdcgg5jt346413", "  1ftfw1et4efc23745", "  3gcpcrec2jg473991", 
                                                                                                                                                                                                    "  2c4rdgeg9jr237989"), lot = c(159348797L, 166951262L, 167655728L, 
                                                                                                                                                                                                                                    167753855L, 167763266L, 167655771L), state = c("new jersey", 
                                                                                                                                                                                                                                                                                   "tennessee", "georgia", "virginia", "florida", "texas"), country = c(" usa", 
                                                                                                                                                                                                                                                                                                                                                        " usa", " usa", " usa", " usa", " usa"), condition = c("10 days left", 
                                                                                                                                                                                                                                                                                                                                                                                                               "6 days left", "2 days left", "22 hours left", "22 hours left", 
                                                                                                                                                                                                                                                                                                                                                                                                               "2 days left")), row.names = c(NA, 6L), class = "data.frame")

CodePudding user response:

In functions, it's better to use data[[i]] notation than data$i, and if you change that, your function works fine!

nth_largest_group <- function(data_frame, column, n) {
  target_col <- data_frame[[column]]
  uni_sorted <- unique(sort(target_col, T))[n]   
  nth_max <- data_frame[target_col == uni_sorted, ]
  return(nth_max)
}

nth_largest_group(US_Car_df, 'price', 1)
# price     brand model year  title_status mileage
# 5 27700 chevrolet  1500 2018 clean vehicle    6654
# color                 vin       lot   state country
# 5   red   3gcpcrec2jg473991 167763266 florida     usa
# condition
# 5 22 hours left

However, which.max is especially designed for this, and you may do this much simpler.

Since

US_Car_df[which.max(US_Car_df$price), ]

already gives your "hard code" result, your function reduces to:

nth_largest_group1 <- function(data_frame, column) {
  return(data_frame[which.max(data_frame[[column]]), ])
  }

nth_largest_group1(US_Car_df, 'price')

Note: I would recommend you to work through some of the basic functions in R to get an overview. Here is the short and useful R reference card.

CodePudding user response:

Your code is not working because - inside the function - R doesn't know how to subset data_frame$column when column is set to price.

You've got two ways to fix this:

  1. Subset the dataframe with a string: data_frame[[column]], but now we need to call the function using "price" (as a string).
  2. Subset the dataframe with a dplyr verb dplyr::pull(data_frame, {{column}}).

Examples:


nth_largest_group <- function(data_frame, column, n) {
  
  target_col <- data_frame[[column]]
  uni_sorted <- unique(sort(target_col, TRUE))[[n]]
  
  data_frame[target_col == uni_sorted, ]
  
}


> nth_largest_group(mtcars, "cyl", 2)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
nth_largest_group <- function(data_frame, column, n) {

  # {{...}} is telling R to interpret column as a variable in data_frame
  target_col <- dplyr::pull(data_frame, {{column}})
  uni_sorted <- unique(sort(target_col, TRUE))[[n]]
  
  data_frame[target_col == uni_sorted, ]
  
}


> nth_largest_group(mtcars, cyl, 2)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

Note the difference in how the functions are called. cyl is quoted in the first example, but not the second.

  •  Tags:  
  • r
  • Related