Home > front end >  Elegant way to correctly index a Dataframe after Subsetting
Elegant way to correctly index a Dataframe after Subsetting

Time:02-08

I have the following dummy dataframe with two columns:

set.seed(666)
df = data.frame(group = rep(c(-1,1), times = 5, each = 1),
                values = runif(10))

   group     values
1     -1 0.77436849
2      1 0.19722419
3     -1 0.97801384
4      1 0.20132735
5     -1 0.36124443
6      1 0.74261194
7     -1 0.97872844
8      1 0.49811371
9     -1 0.01331584
10     1 0.25994613

I want to find the row index of the maximum value from group "1" in the original dataframe. Applying which.max() after subsetting the dataframe will return the wrong row:

which.max(df[df[,1]==1,2]) #returns '3'

As a workaround I did the following:

sdf = df[df[,1] == 1,] # subset of dataframe that keeps row names
rownames(sdf[which.max(sdf[,2]),]) # returns '6' 

Which does return the correct row index from the original dataframe. However I feel there must be an easier, more elegant solution but can't think of anything else myself. Any ideas?

I feel really stupid for asking this question but it seems I'm overlooking something very simple.

CodePudding user response:

Use match() to create a logical vector with 1 being the group of interest, and NA for all other groups. Use it as a filter on values,

which.max(match(df$group, 1) * df$values)

match(df$group, 1) (for any value, not just 1) returns a vector of NA or 1. Multiplying df$values returns either NA or the original value. which.max() select the maximum of the original values, ignoring (in the calculation of the maximum) the NA values.

CodePudding user response:

This is the tidyverse way:

library(tidyverse)
set.seed(666)
df = data.frame(group = rep(c(-1,1), times = 5, each = 1),
                values = runif(10))

df %>%
  mutate(row_number = row_number()) %>%
  group_by(group) %>%
  # descending sorting
  arrange(-values) %>%
  # pick first e.g. maximum
  slice(1) %>%
  select(group, row_number)
#> # A tibble: 2 x 2
#> # Groups:   group [2]
#>   group row_number
#>   <dbl>      <int>
#> 1    -1          7
#> 2     1          6

Created on 2022-02-08 by the reprex package (v2.0.1)

CodePudding user response:

Here I use a base R approach, but it's also not very elegant.

which(df[, 2] == max(df[which(df[, 1] == 1), 2]))
[1] 6

Input

df
   group     values
1     -1 0.77436849
2      1 0.19722419
3     -1 0.97801384
4      1 0.20132735
5     -1 0.36124443
6      1 0.74261194
7     -1 0.97872844
8      1 0.49811371
9     -1 0.01331584
10     1 0.25994613
  •  Tags:  
  • Related