I have a dataset that has student ids and name of major along with semester. The main complicating factor, is that if a student returns to a previous major, I want it to count as another major
StudID Major Term
12345 History 202101
12345 History 202102
12345 Management 202201
12345 History 202202
Desired result
StudID Major Term MajorNumber
12345 History 202101 1
12345 History 202102 1
12345 Management 202201 2
12345 History 202202 3
The only thing I have tried is pulling in the earliest term for each major then doing a ranking within StudentID based on each majors earliest term, but that means that final "history" becomes a 1 instead of 3
CodePudding user response:
The data.table::rleid
function does this. Specifically, it will give all consecutive values that are the same the same ID, and will increment whenever there is a change.
With data.table,
library(data.table)
your_dt = as.data.table(your_data)
your_dt[, MajorNumber := rleid(Major), by = .(StudID)]
your_dt
# StudID Major Term MajorNumber
# 1: 12345 History 202101 1
# 2: 12345 History 202102 1
# 3: 12345 Management 202201 2
# 4: 12345 History 202202 3
Or with dplyr
:
library(dplyr)
your_data %>%
group_by(StudID) %>%
mutate(MajorNumber = data.table::rleid(Major)) %>%
ungroup()
CodePudding user response:
In base R
, use ave
rle
dat$MajorNumber <- with(dat, as.integer(ave(Major, StudID,
FUN = \(x) with(rle(x), rep(seq_along(values), lengths)))))
dat$MajorNumber
[1] 1 1 2 3
CodePudding user response:
Here is a dplyr
solution:
library(dplyr)
df %>%
mutate(x = cumsum(Major != lag(Major, default = Major[1])) 1)
StudID Major Term x
1 12345 History 202101 1
2 12345 History 202102 1
3 12345 Management 202201 2
4 12345 History 202202 3
CodePudding user response:
It may not be the best of the approach, but i tried the below code
data
data <- tribble(
~StudID, ~Major, ~Term,
12345, 'History', 202101,
12345, 'History' , 202102,
12345, 'Management', 202201,
12345, 'History', 202202
)
code
data2 <- data %>%
mutate(term2=substr(as.character(Term),1,4)) %>%
group_by(StudID,Major,term2) %>% slice(1) %>%
ungroup() %>%
arrange(StudID,Term) %>% mutate(row=row_number())
data3 <- data %>% full_join(data2 %>% select(-term2), by=c('StudID','Term','Major')) %>%
fill(row)