I'm rather new to R and I'm struggling with the solution to the problem. I have 2 datasets.
First is a transactions records per customer:
transactions <- dplyr::tibble(order_id = c(1,2,3),
customer_id = c(1,1,2),
order_date =c("01.02.2023 10:00:00","01.04.2023 10:00:00","01.02.2023 10:00:00"))
Second one is when customer received an email, customer can receive several emails between orders:
emails <- dplyr::tibble(email_id = c(1,2,3,4,5,6),
customer_id = c(1,1,1,1,2,2),
email_date =c("01.01.2023 10:00:00",
"01.01.2023 12:00:00",
"01.02.2023 08:00:00",
"01.02.2023 09:00:00",
"01.02.2023 10:00:00",
"01.02.2023 11:00:00"))
I would like to add a column to the first dataset that will return TRUE if customer has received an email in the last 24 hours prior to the transaction.
customer_id | order_id | order_date | email_last_24_hours |
---|---|---|---|
1 | 1 | 01.02.2023 10:00:00 | TRUE |
1 | 2 | 01.02.2023 09:00:00 | FALSE |
CodePudding user response:
You can use the merge()
function in R to join the two datasets on the customer_id column. Then, you can use the difftime()
function to calculate the time difference between the email_date
and order_date
columns. Finally, you can use an ifelse()
statement to create a new column that returns "TRUE" if the time difference is less than or equal to 24 hours and "FALSE" otherwise.
# Load the two datasets
transactions <- read.csv("transactions.csv")
emails <- read.csv("emails.csv")
# Merge the datasets on customer_id
merged_data <- merge(transactions, emails, by="customer_id")
# Calculate time difference between email_date and order_date
merged_data$time_difference <- difftime(merged_data$order_date, merged_data$email_date, units="hours")
# Create a new column that returns TRUE if time difference is less than or equal to 24 hours
merged_data$email_last_24_hours <- ifelse(merged_data$time_difference <= 24, "TRUE", "FALSE")
or using lapply:
transactions$email_last_24_hours <- lapply(transactions$customer_id, function(x){
email_date <- emails[emails$customer_id == x, "email_date"]
if(is.na(email_date) || difftime(transactions[transactions$customer_id == x, "order_date"], email_date, units="hours") > 24) {
return("FALSE")
} else {
return("TRUE")
}
})
CodePudding user response:
For this solution to truly shine, you should have a transaction ID too as well or at least something to identify an order per customer.
With dplyr
and lubridate
you can do
library(tidyverse)
df_1 %>%
left_join(df_2) %>%
mutate(
emailed_24 = email_date %within% interval(order_date, order_date %m-% days(1))
) # Checking if email_date is within an interval, order_date - 24h (days(1))
# A tibble: 2 × 4
customer_id order_date email_date emailed_24
<dbl> <dttm> <dttm> <lgl>
1 1 2023-02-01 10:00:00 2023-02-01 09:00:00 TRUE
2 2 2023-02-01 10:00:00 2022-12-31 09:00:00 FALSE
CodePudding user response:
make a test dataset:
transactions <- dplyr::tibble(customer_id = c(1,2),
order_date =c("01.02.2023 10:00:00","01.02.2023 10:00:00"))
emails <- dplyr::tibble(customer_id = c(1,1,2,2),
email_date =c("01.02.2023 09:00:00",
"01.07.2022 09:00:00",
"31.12.2022 09:00:00",
"11.10.2022 09:00:00"))
transactions$order_date <- lubridate::dmy_hms(transactions$order_date)
emails$email_date <- lubridate::dmy_hms(emails$email_date)
write the function that does what you want:
has_received_email <- function(customer_id, order_date, emails) {
purrr::map2_lgl(customer_id, order_date,
function(.customer_id, .order_date) {
email_dates <- emails %>%
dplyr::filter(customer_id == .customer_id) %>%
dplyr::pull(email_date)
any(difftime(.order_date, email_dates, units="hours")<=24)
})
}
Use it to add a column to your transaction table:
transactions %>%
dplyr::mutate(received_email = has_received_email(customer_id, order_date, emails))
customer_id | order_date | received_email |
---|---|---|
1 | 2023-02-01 10:00:00 | TRUE |
2 | 2023-02-01 10:00:00 | FALSE |
or otherwise join, group and summarize:
dplyr::left_join(transactions, emails, by="customer_id") %>%
dplyr::group_by_at(all_of(names(transactions))) %>%
dplyr::summarize(received_email = any(difftime(order_date, email_date, units="hours")<=24))