Passing a user defined function to `dplyr::summarize()` when 'data' is an argument of user-CodePudding

I am attempting to calculate a forestry biometric called top height for a dataset containing several forest stands each with numerous plots. This biometric requires finding the largest diameter trees representing 40 trees per acre in a plot or a stand, calculating the cumulative trees per acre they represent, and their cumulative height, then dividing the cumulative height by the cumulative trees per acre. This requires a user-defined function, which I have created. My function accepts five arguments: data - a data.frame of tree biometric data, dbh - the column representing the diameter a breast height for individual trees, ht - the column representing the height for the individual trees, tpa - the trees per acre each individual represents, and n - the number of trees per acre to consider in the calculation, by default this is 40 (a forest biometrics standard value in empirical units). As part of my user defined function, I need to order the trees within a plot or stand by the descending order of dbh. I am attempting to use dplyr:: group_by() %>% summarize()to perform this function on each plot and stand combination. However, when I use the "piping" method to pass the data from the group_by() to the summarize() function, the data do not get passed. R throws the following error:

Error in `summarize()`:
! Problem while computing `TOP_HT = topht(dbh = dbh, ht = ht, tpa =
  tpa, n = 40)`.
ℹ The error occurred in group 1: groups = "A".
Caused by error:
! argument "data" is missing, with no default
Run `rlang::last_error()` to see where the error occurred.

The obvious answer would simply be take out the data argument and just define the function on the tree biometric arguments. However, this won't work as I need to order all of the variables by descending order of dbh. Is there a way I can pass the grouped data to the data argument within my call to summarize()? Below is my reproducible example with fake data:

##Loading Necessary Package##
library(dplyr)

##Setting Random Number Seed for Reproducibility##
set.seed(55)

##Generating Some Fake Data## 
groups<-c(rep("A", 5), rep("B", 5))
ht<-rnorm(10, 125, 20)
tpa<-rnorm(10, 150, 60)
dbh<-rnorm(10, 20, 2)
DF<-data.frame(groups=groups, dbh=dbh, ht=ht, tpa=tpa)

##Defining the topht function##
topht<-function(data, dbh=NULL, ht=NULL, tpa=NULL, n=40){ #function parameters
  
  ##evaluate function parameters in the data environment
  tmp<-eval(substitute(dbh), envir = data)
  odata<-data[base::order(tmp, decreasing=TRUE),]
  ht<-eval(substitute(ht), envir=odata)
  tpa<-eval(substitute(tpa), envir=odata)
  
  #creating variables for cumulative trees per acre and cumulative height calculations#
  cumtpa<-0
  cumht<-0
  
  #beginning a loop to calculate top height#
  for(i in 1:nrow(odata)){#setting looping range
    if(cumtpa < n){ #only run cumulative adding when cumulative trees per acre is less than n
      cumtpa<-tpa[i] cumtpa
      cumht<-(ht[i]*tpa[i]) cumht
    }#Close conditional
    if(cumtpa==n){#End the loop if cumulative tpa = n
      break
    }#End Conditional
    if(cumtpa > n){#Adjust final tree's weight when cumulative tpa exceeds n and end loop
      delta <- cumtpa - n
      cumtpa<-cumtpa-delta
      cumht<-cumht-(delta*ht[i])
      break
    }#End Conditional
    if(cumtpa>0){#Define calculation of top height when trees per acre > 0
      topht<-cumht/cumtpa
    }else{#Define complement of conditional
      topht<-0
    }#Close conditional
  }#Close loop
  return(topht)#Output top height
}#Close function

##Attempting to run top height function independently for groups A and B##
out<-as.data.frame(DF %>% group_by(groups) %>% summarize(TOP_HT=topht(dbh=dbh,ht=ht,tpa=tpa,n=40)))#Throws error

CodePudding user response：

I tried to repair your function and apply it to your data:

library(dplyr)

topht <- function(data, dbh = NULL, ht = NULL, tpa = NULL, n = 40){ 
  
  ##evaluate function parameters in the data environment
  tmp <- data %>% pull({{ dbh }})
  odata <- data[base::order(tmp, decreasing=TRUE),]
  ht <- odata %>% pull({{ ht }})
  tpa <- data %>% pull({{ tpa }})
  
  #creating variables for cumulative trees per acre and cumulative height calculations#
  cumtpa <- 0
  cumht <- 0
  outcome <- 0
  
  for(i in 1:nrow(odata)) {
    
    if(cumtpa < n){ 
      
      cumtpa <- tpa[i]   cumtpa
      cumht <- (ht[i] * tpa[i])   cumht
      
    } else if(cumtpa == n){
      
      break
      
    } else  {
      
      delta <- cumtpa - n
      cumtpa <- cumtpa - delta
      cumht <- cumht - (delta*ht[i])
      break
      
    }
    
    if(cumtpa > 0) {
      
      outcome <- cumht / cumtpa
      
    } else {
      
      outcome <- 0
      
    }
    
  }   
  
  outcome
}

Now we apply this function to each group:

DF %>% 
  group_by(groups) %>% 
  group_modify(~ .x %>% summarize(TOP_HT = topht(., dbh = dbh, ht = ht, tpa = tpa, n = 40))) %>% 
  ungroup() %>% 
  as.data.frame()

We want to apply topht to each group, so we use group_modify (it's like purrr's little sister). This returns

  groups    TOP_HT
1      A  88.75246
2      B 123.01531

A few words of explanation:

Since your function is named topht, you really should not use topht as variable name (even inside this function). I changed it to outcome.
outcome should be defined / initialised with some value. I chose 0, NA or something else might also be possible.
return() at the end of a function is unneccessary. Just use the variable name.
To evaluate the function's arguments (like dbh = dbh) you need the curly-curly operator. As a reference: https://www.r-bloggers.com/2019/06/curly-curly-the-successor-of-bang-bang/
Your first if-construction should be packed together into an if-else if - else construction.
To improve readability, you can use some spacing (see http://adv-r.had.co.nz/Style.html).