I'm trying to create a function, that returns the nth largest group (or element if only 1 exists), of a data frame by uniquely sorting a column of the data frame and then passing that argument as a row argument into the data frame. I'm new at R and I'm in a data science post graduate program. My function seems to not work, it only returns the column names, However, the hard code, does work. I posted my results below.
Just a background, I'm, using a car database. I'm trying to return not just the nth largest value, but all elements of the data frame that correspond to the nth largest price in the price column. The hard code works exactly how I want it to. But the function does not.
nth_largest_group <- function(data_frame, column, n) {
target_col <- data_frame$column
uni_sorted <- unique(sort(target_col, T))[n]
nth_max <- data_frame[target_col == uni_sorted,]
return(nth_max)
}
# [1] price brand model year title_status mileage color vin
# [9] lot state country condition
# <0 rows> (or 0-length row.names)
target_col <- US_Car_df$price
uni_sorted <- unique(sort(target_col, T))[1]
nth_max <- US_Car_df[target_col == uni_sorted, ]
print(nth_max)
# price brand model year title_status mileage color vin lot state country
# 503 84900 mercedes-benz sl-class 2017 clean vehicle 25302 silver wddjk7ea3hf044968 167607883 florida usa
# condition
# 503 2 days left
Data
# dput(head(US_Car_df))
US_Car_df <- structure(list(price = c(6300L, 2899L, 5350L, 25000L, 27700L,
5700L), brand = c("toyota", "ford", "dodge", "ford", "chevrolet",
"dodge"), model = c("cruiser", "se", "mpv", "door", "1500", "mpv"
), year = c(2008L, 2011L, 2018L, 2014L, 2018L, 2018L), title_status = c("clean vehicle",
"clean vehicle", "clean vehicle", "clean vehicle", "clean vehicle",
"clean vehicle"), mileage = c(274117L, 190552L, 39590L, 64146L,
6654L, 45561L), color = c("black", "silver", "silver", "blue",
"red", "white"), vin = c(" jtezu11f88k007763", " 2fmdk3gc4bbb02217",
" 3c4pdcgg5jt346413", " 1ftfw1et4efc23745", " 3gcpcrec2jg473991",
" 2c4rdgeg9jr237989"), lot = c(159348797L, 166951262L, 167655728L,
167753855L, 167763266L, 167655771L), state = c("new jersey",
"tennessee", "georgia", "virginia", "florida", "texas"), country = c(" usa",
" usa", " usa", " usa", " usa", " usa"), condition = c("10 days left",
"6 days left", "2 days left", "22 hours left", "22 hours left",
"2 days left")), row.names = c(NA, 6L), class = "data.frame")
CodePudding user response:
In functions, it's better to use data[[i]]
notation than data$i
, and if you change that, your function works fine!
nth_largest_group <- function(data_frame, column, n) {
target_col <- data_frame[[column]]
uni_sorted <- unique(sort(target_col, T))[n]
nth_max <- data_frame[target_col == uni_sorted, ]
return(nth_max)
}
nth_largest_group(US_Car_df, 'price', 1)
# price brand model year title_status mileage
# 5 27700 chevrolet 1500 2018 clean vehicle 6654
# color vin lot state country
# 5 red 3gcpcrec2jg473991 167763266 florida usa
# condition
# 5 22 hours left
However, which.max
is especially designed for this, and you may do this much simpler.
Since
US_Car_df[which.max(US_Car_df$price), ]
already gives your "hard code" result, your function reduces to:
nth_largest_group1 <- function(data_frame, column) {
return(data_frame[which.max(data_frame[[column]]), ])
}
nth_largest_group1(US_Car_df, 'price')
Note: I would recommend you to work through some of the basic functions in R to get an overview. Here is the short and useful R reference card.
CodePudding user response:
Your code is not working because - inside the function - R doesn't know how to subset data_frame$column
when column
is set to price
.
You've got two ways to fix this:
- Subset the dataframe with a string:
data_frame[[column]]
, but now we need to call the function using"price"
(as a string). - Subset the dataframe with a dplyr verb
dplyr::pull(data_frame, {{column}})
.
Examples:
nth_largest_group <- function(data_frame, column, n) {
target_col <- data_frame[[column]]
uni_sorted <- unique(sort(target_col, TRUE))[[n]]
data_frame[target_col == uni_sorted, ]
}
> nth_largest_group(mtcars, "cyl", 2)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
nth_largest_group <- function(data_frame, column, n) {
# {{...}} is telling R to interpret column as a variable in data_frame
target_col <- dplyr::pull(data_frame, {{column}})
uni_sorted <- unique(sort(target_col, TRUE))[[n]]
data_frame[target_col == uni_sorted, ]
}
> nth_largest_group(mtcars, cyl, 2)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Note the difference in how the functions are called. cyl
is quoted in the first example, but not the second.