How to utilize Dplyr PULL inside a custom function with APPLY()-CodePudding

I am using the apply() function to move through each row of a dataframe in order to perform certain operations on various columns.

Background: I have two datasets.

One is of TV and Radio ads that have run in specific locations and at different times. In the example below, I am examining an interval 10 minutes before the spot runs.


    spot_DMA           spot_time         before_spot
    1 LOS ANGELES 2022-01-31 10:02:00 2022-01-31 09:52:00
    2    NEW YORK 2022-02-01 14:22:00 2022-02-01 14:12:00
    3     ATLANTA 2022-02-02 08:20:00 2022-02-02 08:10:00
    4      AUSTIN 2022-02-03 03:16:00 2022-02-03 03:06:00

The second is web traffic. I want to know what kind of web traffic occurs surrounding the ad spots. "web_sessions" is a sum of the number of sessions on the target website.


                  web_time     web_DMA web_sessions
    1  2022-01-31 09:55:00 LOS ANGELES            2
    2  2022-01-31 10:15:00 LOS ANGELES            5
    3  2022-01-31 10:18:00 LOS ANGELES            3
    4  2022-01-31 10:20:00 LOS ANGELES            5
    5  2022-01-31 10:24:00 LOS ANGELES            4
    6  2022-01-31 10:28:00 LOS ANGELES            4
    7  2022-02-02 08:11:00     ATLANTA            6
    8  2022-02-02 08:22:00     ATLANTA            7
    9  2022-02-02 08:23:00     ATLANTA            8
    10 2022-02-02 08:42:00     ATLANTA            4
    11 2022-02-02 08:43:00     ATLANTA            3
    12 2022-02-02 08:45:00     ATLANTA            1

Ad spot data:

    spot.time <- c("2022-01-31 10:02:00", "2022-02-01 14:22:00", 
                   "2022-02-02 08:20:00", "2022-02-03 03:16:00")
    
    spot.DMA <- c("LOS ANGELES", "NEW YORK", "ATLANTA", "AUSTIN")
    
    time_before <- 10
    
    spot_df <- data.frame(spot_DMA = spot.DMA, spot_time = spot.time) %>% 
      mutate(spot_time = as.POSIXct(spot_time)) %>% 
      mutate(before_spot = spot_time - minutes(time_before)) #10 minutes before the spot ran

Web traffic data:


    web.time <- c("2022-01-31 09:55:00", "2022-01-31 10:15:00", 
                  "2022-01-31 10:18:00", "2022-01-31 10:20:00",
                  "2022-01-31 10:24:00", "2022-01-31 10:28:00",
                  "2022-02-02 08:11:00", "2022-02-02 08:22:00", 
                  "2022-02-02 08:23:00", "2022-02-02 08:42:00",
                  "2022-02-02 08:43:00", "2022-02-02 08:45:00")
    
    web.DMA <- c(rep("LOS ANGELES", 6), rep("ATLANTA", 6)) 
    
    web.sessions <- c(2, 5, 3, 5, 4, 4,
                      6, 7, 8, 4, 3, 1)
    
    web_df <- data.frame(web_time = web.time, web_DMA = web.DMA, web_sessions = web.sessions) %>% 
      mutate(web_time = as.POSIXct(web_time))

Using column indices, I can subset the values I want to do calculations with.


       #This works
   
   spot_time_function_subset <- function(df){
     
     spot_dma_extract <- df[[1]]
     
     spot_time_extract <- df[[2]]
     
     spot_before_extract <- df[[3]]
     
     
     web_before <- web_df %>% 
       #Filter to any web sessions in the current DMA . . .
       filter(web_DMA == spot_dma_extract) %>% 
       #. . . between the time the spot ran and 10 minutes before
       filter(between(web_time, as.POSIXct(spot_before_extract), 
                      as.POSIXct(spot_time_extract))) %>% 
       #sum all the sessions in the filtered interval
       summarise(Total = sum(web_sessions)) %>% 
       pull(Total)
     
    
     
     return(web_before/time_before) #average number of sessions per minute
     
   }
   
   web_session_avg <- apply(spot_df, 1, spot_time_function_subset)            
   
   
   spot_df_bind <- bind_cols(spot_df, web_session_avg)

But I want to be able to reference the original column names (not subset based on index) and maintain the class of each variable.

Below is the way I would like it to work. I can pass the column names to the function, but I cannot get it to work with apply()


    #This syntax of passing columns to the function does not work with apply()
    
    spot_time_function_pull <- function(df, spot_DMA, spot_time, before_spot){
      
      spot_dma_extract <- df %>% 
        pull({{spot_DMA}})
      
      spot_time_extract <- df %>% 
        pull({{spot_time}})
      
      spot_before_extract <- df %>% 
        pull({{before_spot}})
      
      #the function will hang up here because the three vectors above
      #   are entire columns
      
      web_before <- web_df %>% 
        #Filter to any web sessions in the current DMA . . .
        filter(web_DMA == spot_dma_extract) %>% 
        #. . . between the time the spot ran and 10 minutes before
        filter(between(web_time, as.POSIXct(spot_before_extract), 
                       as.POSIXct(spot_time_extract))) %>% 
        #sum all the sessions in the filtered interval
        summarise(Total = sum(web_sessions)) %>% 
        pull(Total)
      
      
      
      return(web_before/time_before) #average number of sessions per minute
      
    }
    
    spot_time_function_pull(spot_df, spot_DMA, spot_time, before_spot)            
    
    #Apply does not work
    apply(spot_df, 1, spot_time_function_pull)

Perhaps I don't fully understand what is happening with apply().

Desired outcome: I want to be able to use apply() to step through each row of a dataframe. And I want to be able to reference the original column names with the original class of each variable. It doesn't necessarily have to be Dplyr::pull(), but I certainly prefer Dplyr because the verbs are intuitive.

Is this possible?

CodePudding user response：

If I understood well, you want to use the same function that works but passing column names instead of indices. You can do that by just changing the indices by names in the approach that is already working:

library(tidyverse)
library(lubridate)

#This works

spot_time_function_subset <- function(df){
  
  spot_dma_extract <- df[["spot_DMA"]]
  
  spot_time_extract <- df[["spot_time"]]
  
  spot_before_extract <- df[["before_spot"]]
  
  web_before <- web_df %>% 
    #Filter to any web sessions in the current DMA . . .
    filter(web_DMA == spot_dma_extract) %>% 
    #. . . between the time the spot ran and 10 minutes before
    filter(between(web_time, as.POSIXct(spot_before_extract), 
                   as.POSIXct(spot_time_extract))) %>% 
    #sum all the sessions in the filtered interval
    summarise(Total = sum(web_sessions)) %>% 
    pull(Total)
   
  return(web_before/time_before) #average number of sessions per minute
}

web_session_avg <- apply(spot_df, 1, spot_time_function_subset) 

#> [1] 0.2 0.0 0.6 0.0

Is this what you needed?