Hello Lovely people of SO I hope you are having a blast, I have been having so much fun working with the Base R function "Which" to deal with sequences but on a larger scale I will soon be using tidyverse to do some data wrangling and I cant seem to find a way to incorporate this functions with tidyverse. Let me please explain what I want to do:
I have the following dataset
STUDENT | EVENT |
---|---|
James | GEO_EXAM |
James | PIZZA_PARTY |
Kylie | MATH_EXAM |
Kylie | PIZZA_PARTY |
Laura | MATH_EXAM |
Mark | MATH_EXAM |
Mark | PIZZA_PARTY |
I want to be able to sort the events (for simplicity these events are in the desired order) and then group by STUDENT and find what was the last event or last events or each groug of studebts prior to a PIZZA_PARTY (we will asumen the order gos from top to bottom) the dataset I have is oversiplified to ilustrate this case.
That been said for the previous data set we'll have that for James the event prior to a PIZZA_PARTY was a GEO exam, for Kylie a MATH exam, for Laura there was not event prior to a PIZZA_PARTY and finally for Mark it was a MATH examn. This would be the desired outcome.
But If I omit the fact that I indeed want to resolve this excersie grouping by STUDENTS I could easy use the following code to get started
df$EVENT[which(df$EVENT=="PIZZA_PARTY")-1]
but if I want to use tidyverse to solve this issue in the way that I want I would use:
df %>%
group_by(STUDENT) %>%
mutate(INDEX=(which(EVENT=="PIZZA_PARTY")-1))
but I get an error, I also get an error when I use summarise and I feel like I am hitting a wall so thank you so much to you guys if you could please guide me or let what literature or blogs or books I should review to learn how to propertly use base R functions with tidyverse or please reference some links or blogs that will be useful to solve this issue thank you so much
CodePudding user response:
Issue is when we have a group with missing value i.e. for 'Laura', there is no PIZZA_PARTY
> subset(df, STUDENT == "Laura" & EVENT == "PIZZA_PARTY")
[1] STUDENT EVENT
<0 rows> (or 0-length row.names)
and thus returns logical(0)
. In mutate
, it requires the same length of output as the number of rows (or if it is grouped, the length should match the same number of elements of group). Note that even duplicate values can result in the same error. So, it is better to get the index of the first available value
library(dplyr)
df %>%
group_by(STUDENT) %>%
mutate(INDEX=which(EVENT == "PIZZA_PARTY")[1]-1)
-output
# A tibble: 7 × 3
# Groups: STUDENT [4]
STUDENT EVENT INDEX
<chr> <chr> <dbl>
1 James GEO_EXAM 1
2 James PIZZA_PARTY 1
3 Kylie MATH_EXAM 1
4 Kylie PIZZA_PARTY 1
5 Laura MATH_EXAM NA
6 Mark MATH_EXAM 1
7 Mark PIZZA_PARTY 1
Or another option is to also use match
df %>%
group_by(STUDENT) %>%
mutate(INDEX = match("PIZZA_PARTY", EVENT) - 1)
If we need this to create a RESPONSE variable with the corresponding 'EVENT'
df %>%
group_by(STUDENT) %>%
mutate(RESPONSE = EVENT[match("PIZZA_PARTY", EVENT) - 1])
-output
# A tibble: 7 × 3
# Groups: STUDENT [4]
STUDENT EVENT EVENTNEW
<chr> <chr> <chr>
1 James GEO_EXAM GEO_EXAM
2 James PIZZA_PARTY GEO_EXAM
3 Kylie MATH_EXAM MATH_EXAM
4 Kylie PIZZA_PARTY MATH_EXAM
5 Laura MATH_EXAM <NA>
6 Mark MATH_EXAM MATH_EXAM
7 Mark PIZZA_PARTY MATH_EXAM
data
df <- structure(list(STUDENT = c("James", "James", "Kylie", "Kylie",
"Laura", "Mark", "Mark"), EVENT = c("GEO_EXAM", "PIZZA_PARTY",
"MATH_EXAM", "PIZZA_PARTY", "MATH_EXAM", "MATH_EXAM", "PIZZA_PARTY"
)), class = "data.frame", row.names = c(NA, -7L))
CodePudding user response:
You may use lead
to get the event after which the next event was 'PIZZA_PARTY'
.
library(dplyr)
df %>%
group_by(STUDENT) %>%
summarise(last_event = EVENT[lead(EVENT == 'PIZZA_PARTY', default = FALSE)])
# STUDENT last_event
# <chr> <chr>
#1 James GEO_EXAM
#2 Kylie MATH_EXAM
#3 Mark MATH_EXAM