Home > Software design >  Subsetting a dataframe with dplyr by first a categorial variable and the classwise frequency of a gi
Subsetting a dataframe with dplyr by first a categorial variable and the classwise frequency of a gi

Time:09-25

In my project I want to subset a dataframe with using dplyr by keeping only the rows of say variable1-class that has the most entries witinh a given value range in variable2.
As this may sound quite ambiguous here, but my dataframe is quite big I'll give you an example by the mtcars dataframe.

At the image below you see the dataframe sorted by the variable carb. Now, I want (if possible with dplyr),

  1. to filter out the TWO carb-classes, that have the most according drat-values between 4.0 and 4.3.
  2. From the two winner-classes not all rows should be kept, but only the rows within the mentionned drat-range from 4.0 to 4.3. And subsequently:
  3. If the second place should be shared by two or more carb-classes (as in the example), or the first place should be shared by three or more carb-classes, the 'winning' classes should specified by random (if possible repeatable random, analogous to set.seed)

In the end, the subset should just consist either of the 4 red marked or the 4 blue marked rows (three rows out of carb-class 1 'as winner' and one row from one of the classes 2 or 4, respectively, according to random)

I do have a basic understanding of subsetting with using dplyr. But my problem in this case is to check/compare values which have to be derived from whole, classwise 'sub-entities', - as a filtering-condition.

May anybody help here?

enter image description here

CodePudding user response:

One idea is to use the rank() function, which has an option to settle ties randomly.

library(dplyr, warn.conflicts = FALSE)
set.seed(415)

# subset drat
mt_drat <- mtcars %>%
  filter(between(drat, 4, 4.3)) 

# rank by number of rows, randomly decide ties
mt_rank <- mt_drat %>%
  group_by(carb) %>%
  summarize(n = n()) %>%
  mutate(rank = rank(-n, ties.method = 'random')) %>%
  filter(rank %in% 1:2)
  
# filter to winners
mt_drat %>%
  filter(carb %in% mt_rank$carb)
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4

Created on 2021-09-23 by the reprex package (v2.0.0)

  • Related