Home > Software engineering >  Creating new column informing how many unique values there are in multiple columns
Creating new column informing how many unique values there are in multiple columns

Time:12-02

So, I have a dataframe that looks like this, but with 10k rows:

studentID school_ID_1 school_ID_2 school_ID_3 school_ID_4
5555 3321 3311 3311 3311
5544 3347 3326 3350 3347
5533 3364 3314 3328 3302
5566 3389 3351 3379 3384
5577 3312 3303 3320 3320

If my data had only one student, using the following command:

dep_edu2016_4x = data.frame(table(unlist(dep_edu2016_4x)))

Would tell me how many times this student changed schools in a year through the number of rows the table would return, minus the studentID row. I've already made a dummy to know if they've changed schools, now I need a solution to create a new column informing me how many times each student changed schools

CodePudding user response:

dep_edu2016_4x$unique_schools <- apply(
  subset(dep_edu2016_4x, select = school_ID_1:school_ID_4), 1,
  function(z) length(unique(z)) - 1L)
dep_edu2016_4x
#   studentID school_ID_1 school_ID_2 school_ID_3 school_ID_4 unique_schools
# 1      5555        3321        3311        3311        3311              1
# 2      5544        3347        3326        3350        3347              2
# 3      5533        3364        3314        3328        3302              3
# 4      5566        3389        3351        3379        3384              3
# 5      5577        3312        3303        3320        3320              2

Walk-through:

  • subset(...) is one way to extract just the columns we need, including a range of columns; we could easily have started with dep_edu2016_4x[,-1] to select all except the studentID, but (1) I don't assume there are no other columns; and even if this is your whole dataset, then (2) once we add that column, we cannot recalculate it without biasing it with that additional column.

  • apply(X, MARGIN, FUN) operates on each row (MARGIN=1) of the dataset (X=x), and applies the function (FUN) to that data. In this case, for each row it returns a single integer.

  • length(unique(z)) - 1L calculates the number of distinct schools minus 1, a good proxy for the number of changes.

The count of school-changes is slightly incorrect, though: if a student starts at school 3321, changes to 3311, then changes back to 3321, then length(unique(.))-1L will not work. (This appears to be the case for student 5544.) We'll need something a little more robust if you want to guard against that:

changes <- function(z) if (length(z)) sum(z[-1] != z[-length(z)]) else 0L
dep_edu2016_4x$schools_changes <- apply(subset(dep_edu2016_4x, select = school_ID_1:school_ID_4), 1, changes)
dep_edu2016_4x
#   studentID school_ID_1 school_ID_2 school_ID_3 school_ID_4 unique_schools schools_changes
# 1      5555        3321        3311        3311        3311              1               1
# 2      5544        3347        3326        3350        3347              2               3
# 3      5533        3364        3314        3328        3302              3               3
# 4      5566        3389        3351        3379        3384              3               3
# 5      5577        3312        3303        3320        3320              2               2
  •  Tags:  
  • r
  • Related