Home > Back-end >  Is there a way to select the first row within a group with different conditions in dplyr?
Is there a way to select the first row within a group with different conditions in dplyr?

Time:01-31

I want to select a row for each group created by variable a. It should be the row with the highest value for variable c, but if variable b is TRUE, then the row with b = TRUE and maximum c within that group should be selected.

I have the following code:

set.seed(42)
a <- rep(1:3, each = 3)
b <- sample(c(0,1), size = 9, replace = T) 
c <- sample(1:9, size = 9, replace = F)

df <- data.frame(a = a, 
                 b = b,
                 c = c)

df %>% group_by(a) %>% filter(b == 1) %>% 
       arrange(desc(c), .by_group = T) %>% 
       summarise_all(function(x) x[1]) -> df1

df %>% group_by(a) %>% filter(all(b != 1)) %>% 
       arrange(desc(c), .by_group = T) %>% 
       summarise_all(function(x) x[1]) -> df2 

df3 <- rbind(df1, df2)

This works, but I wonder if there is a simpler way to achieve the same.

CodePudding user response:

You could filter the values for groups and then do your summarize.

df %>% 
  group_by(a) %>% 
  filter(all(b==0) | b==1) %>% 
  summarize(b = first(b), c = max(c))
#       a     b     c
#   <int> <dbl> <int>
# 1     1     0     8
# 2     2     1     5
# 3     3     1     9

So we only keep the values per group if b==1 or if all b==0

CodePudding user response:

We can do it with ifelse inside summarise and without the need to filter b values.

set.seed(42)
a <- rep(1:3, each = 3)
b <- sample(c(0,1), size = 9, replace = T) 
cc <- sample(1:9, size = 9, replace = F)

df <- data.frame(a = a, 
                 b = b,
                 cc = cc)

df |> 
  group_by(a) |> 
  summarise(b = max(b),teste = ifelse(any(b == 1), max(cc[b == 1]), max(cc)) )

Also, never name something c in R.

CodePudding user response:

library(data.table)
setDT(df)
# select the maximum c value, grouped by a and b
# then negative order by b (so rows with b == 1 get on top), 
# and select the first row of each a-group
df[df[, .I[c == max(c)], by = .(a,b)]$V1][order(a,-b), .SD[1], by = a]

CodePudding user response:

library(dplyr)

df %>% group_by(a) %>% 
  arrange(desc(b),desc(c), .by_group = T) %>% 
  slice_head(n = 1) %>% 
  ungroup()
#> # A tibble: 3 × 3
#>       a     b     c
#>   <int> <dbl> <int>
#> 1     1     0     8
#> 2     2     1     5
#> 3     3     1     9

Input data:

set.seed(42)
a <- rep(1:3, each = 3)
b <- sample(c(0,1), size = 9, replace = T) 
c <- sample(1:9, size = 9, replace = F)

df <- data.frame(a = a, 
                 b = b,
                 c = c)
df
#>   a b c
#> 1 1 0 8
#> 2 1 0 7
#> 3 1 0 4
#> 4 2 0 1
#> 5 2 1 5
#> 6 2 1 2
#> 7 3 1 9
#> 8 3 1 3
#> 9 3 0 6

Created on 2023-01-30 with reprex v2.0.2

  • Related