Home > OS >  Is there a limit of factors in `dplyr::group_by`?
Is there a limit of factors in `dplyr::group_by`?

Time:11-18

I'm struggling on how can I calculate the wear of a component using the lag of a variable. However, I need to calculate the wear on different groups, so I'm using the group_by function, but here's a problem, when I use the variable that I need to group, this results in a column of "NA's", but when I test by grouping one another variable that has fewer factors the calculation works.

The dataframe I'm using has 4093902 rows and 52 lines. The variable I need to group to perform my wear calculation has 90183 factors. The other one that I tested and it worked had 11321 factors.

Here's the code I'm using:

final_date = result_data %>%
arrange((time)) %>%
  group_by(id_specific)%>%
  mutate(wear = dplyr::lag(some_value, n = 1, default = NA) - some_value)

Does anyone know if there is a factor limit for grouping? Or any other tips on how I can perform this calculation?

CodePudding user response:

The NA can be a result of either lag which returns the first value by default as NA or from the other column value which can also be NA. Thus, when we do the - (or any arithmetic) if there is any NA in the lhs or rhs, it returns NA. One option is to make use of a function (rowSums) that can use na.rm = TRUE

library(dplyr)
final_date <- result_data %>%
arrange((time)) %>%
  group_by(id_specific)%>%
  mutate(some_value_new = dplyr::lag(some_value, n = 1,
             default = NA)) %>%
  ungroup %>%
  mutate(wear = rowSums(cbind(some_value_new, -1 * some_value), 
              na.rm = TRUE), some_value_new = NULL)

NOTE: It is also better to ungroup before doing the rowSums to get some efficiency

  • Related