How to use R summarise with multiple numeric and text-based conditional subsets-CodePudding

I have a table containing two rows for each ID.

table <- tibble(
  id = c(1,1,2,2,3,3,4,4,5,5),
  row1 = c(2,5,2,5,1,3,2,5,3,2),
  row2 = c("foo", "other foo", "bar", "bar", "bar", "bar other", "other", "foo", "other", "other")
)
> table
# A tibble: 10 × 3
      id  row1 row2 
   <dbl> <dbl> <chr>
 1     1     2 foo  
 2     1     5 other foo
 3     2     2 bar  
 4     2     5 bar  
 5     3     1 bar
 6     3     3 bar other
 7     4     2 other
 8     4     4 foo  
 9     5     3 other
10     5     2 other

I would like to resolve the table to a single row for each ID based on three rules in succession:

If, for each ID, there is one row in which row1 is 5 or more, then choose the row where row1 is less than 5.
Else if, for each ID, there is a row in which row2 contains the word 'other', choose the row where row2 does not contain the word 'other'
Otherwise, for each ID, pick the first row.

I feel there must be a more straightforward way of doing this. This is my attempt so far, but I've can't work out how to resolve the NA to return 'bar'.

table %>%
  group_by(id) %>%
  summarise(
    row1 = ifelse(max(row1) >= 5,
            first(row1[row1 < 5]),
            ifelse(
              grep("other", row2),
              ifelse(
                !is.na(first(row1[grep("other", row2, invert = T)])),
                first(row1[grep("other", row2, invert = T)]),
                first(row1)),
              first(row1))
    ),
    row2 = ifelse(
      max(row1) >= 5,
      first(row2[row1 < 5]),
      ifelse(
        grep("other", row2),
        ifelse(
          !is.na(first(row2[grep("other", row2, invert = T)])),
          first(row2[grep("other", row2, invert = T)]),
          first(row2)),
        first(row2)
    )
)
)

# A tibble: 5 × 3
     id  row1 row2 
  <dbl> <dbl> <chr>
1     1     2 foo  
2     2     2 NA   
3     3     1 bar  
4     4     2 foo  
5     5     3 other

Desired output:

id	row1	row2
1	2	foo
2	2	bar
3	1	bar
4	2	other
5	3	other

Many thanks for your help.

CodePudding user response：

Here is how we can do it:

library(dplyr)
library(tidyr)
library(stringr)

table %>%
  group_by(id) %>% 
  separate_rows(row2) %>%
  mutate(x = ifelse(row1>=5, min(row1),NA),
         y = ifelse(str_detect(row2, 'other'), !str_detect(row2, 'other'), NA)) %>% 
  slice(1) %>% 
  select(-c(x, y))

    id  row1 row2 
  <dbl> <dbl> <chr>
1     1     2 foo  
2     2     2 bar  
3     3     1 bar  
4     4     2 other
5     5     3 other

CodePudding user response：

A dplyr solution:

table %>%
  group_by(id) %>%
  filter(row1 < 5 | n_distinct(row1 < 5) == 1) %>%
  filter(!grepl("other", row2) | n_distinct(grepl("other", row2)) == 1) %>%
  slice(1) %>% ungroup()

# # A tibble: 5 × 3
#      id  row1 row2 
#   <dbl> <dbl> <chr>
# 1     1     2 foo  
# 2     2     2 bar  
# 3     3     1 bar  
# 4     4     2 other
# 5     5     3 other

n_distinct(...) == 1 is used to determine if a condition is all TRUE or all FALSE.

CodePudding user response：

Here is one solution, that leverages this small function, f() using tidyverse or data.table

f <- function(r1,r2) {
  if(sum(r1>=5)==1) return(list("row1" =r1[r1<5], "row2"=r2[r1<5]))
  if(sum(grepl("other",r2))==1) return(list("row1" = r1[!grepl("other",r2)], "row2"=r2[!grepl("other",r2)]))
  list("row1"=r1[1],"row2"=r2[1])
}

Usage

library(tidyverse)

table %>% 
  group_by(id) %>%
  summarize(n=list(f(row1,row2))) %>%
  unnest_wider(n)

library(data.table)

setDT(table)[, f(row1,row2), by=id]

Output:

      id  row1   row2
   <num> <num> <char>
1:     1     2    foo
2:     2     2    bar
3:     3     1    bar
4:     4     2  other
5:     5     3  other

CodePudding user response：

table %>%
  group_by(id) %>%
  subset(
    case_when(
      any(row1 >= 5) ~ row1 < 5,
      any(grepl("other", row2)) ~ !grepl("other", row2),
      T ~ T
    )
  ) %>%
  filter(row_number() == 1) %>%
  ungroup()

This answer takes advantage of dplyr's grouping abilities to check for any() within each group, so it gets easy to know if a certain condition happens within a group.

It also uses case_when() to check for a series of conditions in a prioritized order, implementing what would be a series of if/else's.

Finally, since in whatever case we would like only the first row that matches the criteria, it uses the function row_number() to check whether we're on the first row within the group, in order to select it.

Output is:

# A tibble: 5 x 3
     id  row1 row2     
  <dbl> <dbl> <chr>    
1     1     2 foo      
2     2     2 bar      
3     3     1 bar other
4     4     2 other    
5     5     3 other    
>