Home > Mobile >  Sum of selected columns works on subset of data but not full data set
Sum of selected columns works on subset of data but not full data set

Time:07-10

I have a large R data set with over 90K observations and 400 variables representing patient diagnoses. I want to calculate the sum of the values in selected columns (named Code1 through Code200) and store the value in a new column (mytotal). The code below works when I run it with a subset (around 2K) of the observations.

mysubset <- mysubset %>%
  mutate(mytotal = select(., Code1:Code200) %>% 
           rowSums(na.rm = TRUE))

However, when I try to run the same code on the full (90K observations, same dataframe structure) dataframe, I get an error:

Adding missing grouping variables: patient_num Error in mutate(): ! Problem while computing utils = select(., Code1:Code200) %>% rowSums(na.rm = TRUE). ✖ utils must be size 1, not 92574. ℹ The error occurred in group 1: patient_num = 123456789.

I've searched online for hours to try to resolve the problem or to find an alternative solution, with no luck. If anyone has insights, I'd really appreciate them. Thank you.

Update: Just to save anyone else the hours I wasted trying to figure out the problem, it finally occurred to me to compare the subset and the full data set using class(). It turns out that the full data set had been saved as a grouped dataframe. Once I used ungroup(), the original code worked on the full data set. Apologies for the newbie distress call and thanks for the helpful responses!

CodePudding user response:

Here's a tidyverse approach, where we could take just the columns we want and reshape them into longer data, which will be simpler to sum.

set.seed(42)
df <- matrix(rnorm(9E4*400), nrow= 9E4) |> as.data.frame()

library(tidyverse)
df_sums <- df %>%
  mutate(row = row_number()) %>%
  select(row, V1:V200) %>%
  pivot_longer(-row) %>%
  count(row, wt = value, name = "mytotal")


df %>% 
  bind_cols(df_sums)
  •  Tags:  
  • r
  • Related