Home > Software engineering >  How to return TRUE if there was a record in the past 24 hours in R?
How to return TRUE if there was a record in the past 24 hours in R?

Time:01-20

I'm rather new to R and I'm struggling with the solution to the problem. I have 2 datasets.

First is a transactions records per customer:

transactions <- dplyr::tibble(order_id = c(1,2,3),
                              customer_id = c(1,1,2),
                              order_date =c("01.02.2023 10:00:00","01.04.2023 10:00:00","01.02.2023 10:00:00"))

Second one is when customer received an email, customer can receive several emails between orders:

emails <- dplyr::tibble(email_id = c(1,2,3,4,5,6),
                        customer_id = c(1,1,1,1,2,2),
                        email_date =c("01.01.2023 10:00:00",
                                      "01.01.2023 12:00:00",
                                      "01.02.2023 08:00:00",
                                      "01.02.2023 09:00:00",
                                      "01.02.2023 10:00:00",
                                      "01.02.2023 11:00:00"))

I would like to add a column to the first dataset that will return TRUE if customer has received an email in the last 24 hours prior to the transaction.

customer_id order_id order_date email_last_24_hours
1 1 01.02.2023 10:00:00 TRUE
1 2 01.02.2023 09:00:00 FALSE

CodePudding user response:

You can use the merge() function in R to join the two datasets on the customer_id column. Then, you can use the difftime() function to calculate the time difference between the email_date and order_date columns. Finally, you can use an ifelse() statement to create a new column that returns "TRUE" if the time difference is less than or equal to 24 hours and "FALSE" otherwise.

# Load the two datasets
transactions <- read.csv("transactions.csv")
emails <- read.csv("emails.csv")

# Merge the datasets on customer_id
merged_data <- merge(transactions, emails, by="customer_id")

# Calculate time difference between email_date and order_date
merged_data$time_difference <- difftime(merged_data$order_date, merged_data$email_date, units="hours")

# Create a new column that returns TRUE if time difference is less than or equal to 24 hours
merged_data$email_last_24_hours <- ifelse(merged_data$time_difference <= 24, "TRUE", "FALSE")

or using lapply:

transactions$email_last_24_hours <- lapply(transactions$customer_id, function(x){
  email_date <- emails[emails$customer_id == x, "email_date"]
  if(is.na(email_date) || difftime(transactions[transactions$customer_id == x, "order_date"], email_date, units="hours") > 24) {
    return("FALSE")
  } else {
    return("TRUE")
  }
})

CodePudding user response:

For this solution to truly shine, you should have a transaction ID too as well or at least something to identify an order per customer.

With dplyr and lubridate you can do

library(tidyverse)

df_1 %>% 
  left_join(df_2) %>% 
  mutate(
    emailed_24 = email_date %within% interval(order_date, order_date %m-% days(1))
  ) # Checking if email_date is within an interval, order_date - 24h (days(1))

# A tibble: 2 × 4
  customer_id order_date          email_date          emailed_24
        <dbl> <dttm>              <dttm>              <lgl>     
1           1 2023-02-01 10:00:00 2023-02-01 09:00:00 TRUE      
2           2 2023-02-01 10:00:00 2022-12-31 09:00:00 FALSE     

CodePudding user response:

make a test dataset:

transactions <- dplyr::tibble(customer_id = c(1,2),
                       order_date =c("01.02.2023 10:00:00","01.02.2023 10:00:00"))


emails <- dplyr::tibble(customer_id = c(1,1,2,2),
                 email_date =c("01.02.2023 09:00:00",
                               "01.07.2022 09:00:00",
                               "31.12.2022 09:00:00",
                               "11.10.2022 09:00:00"))

transactions$order_date <- lubridate::dmy_hms(transactions$order_date)
emails$email_date <- lubridate::dmy_hms(emails$email_date)

write the function that does what you want:

has_received_email <- function(customer_id, order_date, emails) {
  purrr::map2_lgl(customer_id, order_date, 
                  function(.customer_id, .order_date) {
                    
                    email_dates <- emails %>% 
                      dplyr::filter(customer_id == .customer_id) %>% 
                      dplyr::pull(email_date)
                    
                    any(difftime(.order_date, email_dates, units="hours")<=24)
                    
                  })
}

Use it to add a column to your transaction table:

transactions %>% 
  dplyr::mutate(received_email = has_received_email(customer_id, order_date, emails))
customer_id order_date received_email
1 2023-02-01 10:00:00 TRUE
2 2023-02-01 10:00:00 FALSE

or otherwise join, group and summarize:

dplyr::left_join(transactions, emails, by="customer_id") %>% 
  dplyr::group_by_at(all_of(names(transactions))) %>% 
  dplyr::summarize(received_email = any(difftime(order_date, email_date, units="hours")<=24))
  • Related