Home > Back-end >  Create category by group using a loop
Create category by group using a loop

Time:01-21

In R, I want to create a "category" variable that depends on differences between the date of birth and the initial date of individuals born within a group (min) (which is in another dataframe), so that an individual will be in a category 1 if was born between the first 10 days within his category, in category 2 if he was born in the second 10 days, etc. and I want to generate categories until all the possible dates are filled.

I hope my example clarifies it. I made a loop that is running but has some disadvantages: 1) It requires to merge the two dataframes, and 2) It requires to specify one by one the categories.

group = c("A","B","C","D","E","F","G","H","I","J")
min = c(100,125,120,98,89,110,115,130,100,90)
max = c(140,185,220,200,145,150,145,170,170,140)
DATES = data.frame(group,min,max)
ind = c(1:20)
group = c("B","B","B","A","C","D","D","E","J","I","H","A","F","I","E","F","E","C","C","A")
birth_date = c(130,180,150,110,130,200,100,140,130,100,155,130,110,120,135,115,100,150,200,105)
BIRTH = data.frame(ind,group,birth_date)
BIRTH = merge(BIRTH, DATES, by = "group")
BIRTH$category = NA
for (i in 1:nrow(BIRTH)){
  if(isTRUE(BIRTH$birth_date[i] <= BIRTH$min[i]   10)==TRUE){
    BIRTH$category[i] = 1
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i]   10 & BIRTH$birth_date[i] < BIRTH$min[i]   10*2)==TRUE) {
    BIRTH$category[i] = 2
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i]   10*2 & BIRTH$birth_date[i] < BIRTH$min[i]   10*3)==TRUE) {
    BIRTH$category[i] = 3
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i]   10*3 & BIRTH$birth_date[i] < BIRTH$min[i]   10*4)==TRUE) {
    BIRTH$category[i] = 4
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i]   10*4 & BIRTH$birth_date[i] < BIRTH$min[i]   10*5)==TRUE) {
    BIRTH$category[i] = 5
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i]   10*5 & BIRTH$birth_date[i] < BIRTH$min[i]   10*6)==TRUE) {
    BIRTH$category[i] = 6
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i]   10*6 & BIRTH$birth_date[i] < BIRTH$min[i]   10*7)==TRUE) {
    BIRTH$category[i] = 7
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i]   10*7 & BIRTH$birth_date[i] < BIRTH$min[i]   10*8)==TRUE) {
    BIRTH$category[i] = 8
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i]   10*8 & BIRTH$birth_date[i] < BIRTH$min[i]   10*9)==TRUE) {
    BIRTH$category[i] = 9
    } else {
    BIRTH$category[i] = 10
  }
}

Any suggestions to make it more elegant/efficient?

CodePudding user response:

You can just check how often the difference can be divied by 10 using the %/% operator. For example

transform(BIRTH, category=(birth_date-min) %/% 10 1)
#    group ind birth_date min max category
# 1      A   4        110 100 140        2
# 2      A  12        130 100 140        4
# 3      A  20        105 100 140        1
# 4      B   1        130 125 185        1
# 5      B   2        180 125 185        6
# 6      B   3        150 125 185        3
# 7      C   5        130 120 220        2
# ...

Note that a few of these are different than your desired output because you seem to have a different of 10 in both category 1 and 2. It seems all your other categories are open on the right other than the first which causes the confusion. The first 10 days would be 0,1,2,3,4,5,6,7,8,9 so if you have a difference of 10, then that would be in the second set of 10 days.

CodePudding user response:

Without the need to merge the data frames, match could be the way. We can write a function ctgrz_dts that categorizes dates, elaborating on @MrFlick's answer.

ctgrz_dts <- \(x, tbl=DATES, g='group', bdc='birth_date') {
  a <- tbl[match(x[[g]], tbl[['group']]), -1]
  ## optional check if birthdate in date range
  valid <- sapply(seq_len(nrow(a)), \(i) 
                  x[[bdc]][i] >= a[i, ]$min & x[[bdc]][i] <= a[i, ]$max)
  ctg <- (x[[bdc]] - a$min) %/% 10   1
  ctg[!valid] <- NA  ## set NA if date not in range
  return(ctg)
}

Note, that I defined arguments for the group and birthdate columns, to make the function more flexible.

Usage

BIRTH$cat <- ctgrz_dts(BIRTH, DATES)
head(BIRTH)
#   ind group birth_date cat
# 1   1     B        130   1
# 2   2     B        180   6
# 3   3     B        150   3
# 4   4     A        110   2
# 5   5     C        130   2
# 6   6     D        200  11

To check, we can merge and run @MrFlick's code now,

BIRTH <- merge(BIRTH, DATES, by = "group")
head(transform(BIRTH, category=(birth_date - min) %/% 10   1))
#   group ind birth_date cat min max category
# 1     A   4        110   2 100 140        2
# 2     A  12        130   4 100 140        4
# 3     A  20        105   1 100 140        1
# 4     B   1        130   1 125 185        1
# 5     B   2        180   6 125 185        6
# 6     B   3        150   3 125 185        3

to confirm that ctgrz_dts() gives the same categories without merging (note, that the order is different after merging).

  • Related