So, I have a dataframe that looks like this, but with 10k rows:
studentID | school_ID_1 | school_ID_2 | school_ID_3 | school_ID_4 |
---|---|---|---|---|
5555 | 3321 | 3311 | 3311 | 3311 |
5544 | 3347 | 3326 | 3350 | 3347 |
5533 | 3364 | 3314 | 3328 | 3302 |
5566 | 3389 | 3351 | 3379 | 3384 |
5577 | 3312 | 3303 | 3320 | 3320 |
If my data had only one student, using the following command:
dep_edu2016_4x = data.frame(table(unlist(dep_edu2016_4x)))
Would tell me how many times this student changed schools in a year through the number of rows the table would return, minus the studentID row. I've already made a dummy to know if they've changed schools, now I need a solution to create a new column informing me how many times each student changed schools
CodePudding user response:
dep_edu2016_4x$unique_schools <- apply(
subset(dep_edu2016_4x, select = school_ID_1:school_ID_4), 1,
function(z) length(unique(z)) - 1L)
dep_edu2016_4x
# studentID school_ID_1 school_ID_2 school_ID_3 school_ID_4 unique_schools
# 1 5555 3321 3311 3311 3311 1
# 2 5544 3347 3326 3350 3347 2
# 3 5533 3364 3314 3328 3302 3
# 4 5566 3389 3351 3379 3384 3
# 5 5577 3312 3303 3320 3320 2
Walk-through:
subset(...)
is one way to extract just the columns we need, including a range of columns; we could easily have started withdep_edu2016_4x[,-1]
to select all except thestudentID
, but (1) I don't assume there are no other columns; and even if this is your whole dataset, then (2) once we add that column, we cannot recalculate it without biasing it with that additional column.apply(X, MARGIN, FUN)
operates on each row (MARGIN=1
) of the dataset (X=x
), and applies the function (FUN
) to that data. In this case, for each row it returns a single integer.length(unique(z)) - 1L
calculates the number of distinct schools minus 1, a good proxy for the number of changes.
The count of school-changes is slightly incorrect, though: if a student starts at school 3321
, changes to 3311
, then changes back to 3321
, then length(unique(.))-1L
will not work. (This appears to be the case for student 5544
.) We'll need something a little more robust if you want to guard against that:
changes <- function(z) if (length(z)) sum(z[-1] != z[-length(z)]) else 0L
dep_edu2016_4x$schools_changes <- apply(subset(dep_edu2016_4x, select = school_ID_1:school_ID_4), 1, changes)
dep_edu2016_4x
# studentID school_ID_1 school_ID_2 school_ID_3 school_ID_4 unique_schools schools_changes
# 1 5555 3321 3311 3311 3311 1 1
# 2 5544 3347 3326 3350 3347 2 3
# 3 5533 3364 3314 3328 3302 3 3
# 4 5566 3389 3351 3379 3384 3 3
# 5 5577 3312 3303 3320 3320 2 2