Home > Software design >  Working with Tidyverse and the base R function ´ Which´
Working with Tidyverse and the base R function ´ Which´

Time:09-29

Hello Lovely people of SO I hope you are having a blast, I have been having so much fun working with the Base R function "Which" to deal with sequences but on a larger scale I will soon be using tidyverse to do some data wrangling and I cant seem to find a way to incorporate this functions with tidyverse. Let me please explain what I want to do:

I have the following dataset

STUDENT EVENT
James GEO_EXAM
James PIZZA_PARTY
Kylie MATH_EXAM
Kylie PIZZA_PARTY
Laura MATH_EXAM
Mark MATH_EXAM
Mark PIZZA_PARTY

I want to be able to sort the events (for simplicity these events are in the desired order) and then group by STUDENT and find what was the last event or last events or each groug of studebts prior to a PIZZA_PARTY (we will asumen the order gos from top to bottom) the dataset I have is oversiplified to ilustrate this case.

That been said for the previous data set we'll have that for James the event prior to a PIZZA_PARTY was a GEO exam, for Kylie a MATH exam, for Laura there was not event prior to a PIZZA_PARTY and finally for Mark it was a MATH examn. This would be the desired outcome.

But If I omit the fact that I indeed want to resolve this excersie grouping by STUDENTS I could easy use the following code to get started

df$EVENT[which(df$EVENT=="PIZZA_PARTY")-1]

but if I want to use tidyverse to solve this issue in the way that I want I would use:

df %>% 
  group_by(STUDENT) %>% 
  mutate(INDEX=(which(EVENT=="PIZZA_PARTY")-1))

but I get an error, I also get an error when I use summarise and I feel like I am hitting a wall so thank you so much to you guys if you could please guide me or let what literature or blogs or books I should review to learn how to propertly use base R functions with tidyverse or please reference some links or blogs that will be useful to solve this issue thank you so much

CodePudding user response:

Issue is when we have a group with missing value i.e. for 'Laura', there is no PIZZA_PARTY

> subset(df, STUDENT == "Laura" & EVENT == "PIZZA_PARTY")
[1] STUDENT EVENT  
<0 rows> (or 0-length row.names)

and thus returns logical(0). In mutate, it requires the same length of output as the number of rows (or if it is grouped, the length should match the same number of elements of group). Note that even duplicate values can result in the same error. So, it is better to get the index of the first available value

library(dplyr)
df %>% 
  group_by(STUDENT) %>% 
  mutate(INDEX=which(EVENT == "PIZZA_PARTY")[1]-1)

-output

# A tibble: 7 × 3
# Groups:   STUDENT [4]
  STUDENT EVENT       INDEX
  <chr>   <chr>       <dbl>
1 James   GEO_EXAM        1
2 James   PIZZA_PARTY     1
3 Kylie   MATH_EXAM       1
4 Kylie   PIZZA_PARTY     1
5 Laura   MATH_EXAM      NA
6 Mark    MATH_EXAM       1
7 Mark    PIZZA_PARTY     1

Or another option is to also use match

df %>%
    group_by(STUDENT) %>%
    mutate(INDEX = match("PIZZA_PARTY", EVENT) - 1)

If we need this to create a RESPONSE variable with the corresponding 'EVENT'

df %>%  
      group_by(STUDENT) %>% 
      mutate(RESPONSE = EVENT[match("PIZZA_PARTY", EVENT) - 1])

-output

# A tibble: 7 × 3
# Groups:   STUDENT [4]
  STUDENT EVENT       EVENTNEW 
  <chr>   <chr>       <chr>    
1 James   GEO_EXAM    GEO_EXAM 
2 James   PIZZA_PARTY GEO_EXAM 
3 Kylie   MATH_EXAM   MATH_EXAM
4 Kylie   PIZZA_PARTY MATH_EXAM
5 Laura   MATH_EXAM   <NA>     
6 Mark    MATH_EXAM   MATH_EXAM
7 Mark    PIZZA_PARTY MATH_EXAM

data

df <- structure(list(STUDENT = c("James", "James", "Kylie", "Kylie", 
"Laura", "Mark", "Mark"), EVENT = c("GEO_EXAM", "PIZZA_PARTY", 
"MATH_EXAM", "PIZZA_PARTY", "MATH_EXAM", "MATH_EXAM", "PIZZA_PARTY"
)), class = "data.frame", row.names = c(NA, -7L))

CodePudding user response:

You may use lead to get the event after which the next event was 'PIZZA_PARTY'.

library(dplyr)

df %>%
  group_by(STUDENT) %>%
  summarise(last_event = EVENT[lead(EVENT == 'PIZZA_PARTY', default = FALSE)])

#  STUDENT last_event
#  <chr>   <chr>     
#1 James   GEO_EXAM  
#2 Kylie   MATH_EXAM 
#3 Mark    MATH_EXAM 
  • Related