Home > Software engineering >  Dataframe subsetting with conditional values
Dataframe subsetting with conditional values

Time:05-03

I have some data containing different tree species and their dimensions that I want to subset, leaving only those rows where the tree's dimensions are greater than the 75th percentile. The 75th percentile will vary by species, and so I want to filter / subset the data frame so that the correct 75th percentile is used in the subset for each species.

Here's some dummy data

## GENERATE DUMMY DATA

sample_data <- data.frame(SPECIES = rep(c('OAK', 'ELM', 'BEECH', 'ASH'), each = 50),
                          SIZE = runif(200, min = 15, max = 200)) 

## COMPUTE 75TH PERCENTILES FOR EACH SPECIES
library(dplyr)

data <- data.frame (sample_data %>%
                             group_by(SPECIES) %>%
                             summarise(SIZE = quantile(SIZE, probs = .75)))

data  

  SPECIES     SIZE
1     ASH 141.2837
2   BEECH 152.4670
3     ELM 154.6553
4     OAK 121.5114

Now I want to subset the original data frame (sample_data) so that I'm left with a new data frame that only shows me the ASH trees that are >141.3, the BEECH trees that are >152.5, the ELM trees that are > 154.7 and the OAK trees that are >121.5

Happy to accept a dplyr or data.table or other solution

Thanks

CodePudding user response:

Using data.table,

require(data.table)
#> Loading required package: data.table
df <- data.table(SPECIES = rep(c('OAK', 'ELM', 'BEECH', 'ASH'), each = 50),
                          SIZE = runif(200, min = 15, max = 200)) 

df[,criteria := quantile(SIZE, probs=.75),by=SPECIES][SIZE>criteria,.(SPECIES, SIZE)]

Output

#>     SPECIES     SIZE
#>  1:     OAK 166.6108
#>  2:     OAK 152.5978
#>  3:     OAK 165.3719
#>  4:     OAK 169.5894
#>  5:     OAK 145.0316
#>  6:     OAK 144.9761
#>  7:     OAK 193.5518
#>  8:     OAK 167.0893
#>  9:     OAK 171.8388
#> 10:     OAK 163.6997
#> 11:     OAK 162.8079
#> 12:     OAK 154.3202
#> 13:     OAK 155.4055
#> 14:     ELM 182.4347
#> 15:     ELM 158.9178
#> 16:     ELM 160.9994
#> 17:     ELM 172.3293
#> 18:     ELM 160.3153
#> 19:     ELM 171.6101
#> 20:     ELM 197.8908
#> 21:     ELM 191.5175
#> 22:     ELM 187.9439
#> 23:     ELM 195.3205
#> 24:     ELM 186.9787
#> 25:     ELM 185.5459
#> 26:     ELM 176.5530
#> 27:   BEECH 142.0979
#> 28:   BEECH 163.7029
#> 29:   BEECH 196.2091
#> 30:   BEECH 193.7850
#> 31:   BEECH 184.2935
#> 32:   BEECH 180.5350
#> 33:   BEECH 168.1937
#> 34:   BEECH 198.0463
#> 35:   BEECH 166.3282
#> 36:   BEECH 175.8253
#> 37:   BEECH 174.0137
#> 38:   BEECH 142.5474
#> 39:   BEECH 158.7505
#> 40:     ASH 192.6718
#> 41:     ASH 161.2425
#> 42:     ASH 159.5650
#> 43:     ASH 173.2908
#> 44:     ASH 198.7314
#> 45:     ASH 172.9008
#> 46:     ASH 197.5794
#> 47:     ASH 173.4665
#> 48:     ASH 161.8918
#> 49:     ASH 198.9472
#> 50:     ASH 161.1769
#> 51:     ASH 178.5557
#> 52:     ASH 162.5603
#>     SPECIES     SIZE

Created on 2022-05-02 by the reprex package (v2.0.1)

CodePudding user response:

One option is to left_join, then we can filter to keep only the rows that are greater than the 75th percentile.

library(dplyr)

sample_data %>% 
  left_join(., data, by = "SPECIES") %>% 
  filter(SIZE.x > SIZE.y) %>% 
  select(-SIZE.y) %>% 
  rename(SIZE = SIZE.x)

Output

   SPECIES     SIZE
1      OAK 183.1905
2      OAK 194.0178
3      OAK 190.2304
4      OAK 177.6368
5      OAK 176.5812
6      OAK 188.6490
7      OAK 180.5927
8      OAK 183.7877
9      OAK 179.2605
10     OAK 190.0034
11     OAK 185.7922
12     OAK 187.4172
13     OAK 174.8787
14     ELM 166.7489
15     ELM 176.6458
16     ELM 142.0045
17     ELM 187.6565
18     ELM 149.7347
19     ELM 191.5863
20     ELM 198.2069
21     ELM 190.9030
22     ELM 160.5611
23     ELM 175.1305
24     ELM 186.8030
25     ELM 151.5687
26     ELM 182.5735
27   BEECH 194.4072
28   BEECH 190.6091
29   BEECH 193.4944
30   BEECH 193.5393
31   BEECH 197.1501
32   BEECH 174.7058
33   BEECH 192.5512
34   BEECH 199.9723
35   BEECH 175.9333
36   BEECH 185.4285
37   BEECH 185.3669
38   BEECH 172.7221
39   BEECH 175.0001
40     ASH 199.2849
41     ASH 174.2263
42     ASH 150.8999
43     ASH 170.8359
44     ASH 148.0282
45     ASH 159.4717
46     ASH 188.6389
47     ASH 197.0871
48     ASH 151.5592
49     ASH 187.1625
50     ASH 196.6947
51     ASH 198.6819
52     ASH 185.4842

CodePudding user response:

Here is a way that avoids pre-computing the percentiles.
Group by SPECIES and filter by the percentiles in one go.

## GENERATE DUMMY DATA

set.seed(2022)
sample_data <- data.frame(SPECIES = rep(c('OAK', 'ELM', 'BEECH', 'ASH'), each = 50),
                          SIZE = runif(200, min = 15, max = 200)) 

## Compute 75th percentiles for each species
## and filter if the SIZE is greater
suppressPackageStartupMessages(library(dplyr))

sample_data %>%
  group_by(SPECIES) %>%
  filter(SIZE > quantile(SIZE, probs = 0.75))
#> # A tibble: 52 × 2
#> # Groups:   SPECIES [4]
#>    SPECIES  SIZE
#>    <chr>   <dbl>
#>  1 OAK      166.
#>  2 OAK      158.
#>  3 OAK      172.
#>  4 OAK      171.
#>  5 OAK      188.
#>  6 OAK      200.
#>  7 OAK      158.
#>  8 OAK      172.
#>  9 OAK      164.
#> 10 OAK      157.
#> # … with 42 more rows

Created on 2022-05-02 by the reprex package (v2.0.1)

  • Related