I'm trying to follow a customer's journey on a website and process the data in R.
I know the following:
- If
V3.previous_event_url
isNA
it is the start of the customer journey - If
V3.next_event_url
isNA
it is the end of the customer journey
So I created a new colomn Order
where if is.na(V3.previous_event_url)
the value is 1
.
Now I want to assign a value (2
) for the second page. The conditions are thus:
- The
V3.event_url
of first page (Order == 1) is the same as theV3.previous_event_url
of what has to become Order == 2 - AND The
V3.next_event_url
of first page (Order == 1) has to be the same asV3.event_url
of what has to become Order == 2.
I've tried several things, but I can't get R to look for values in the row where Order == 1, and compare those to those in the dataset as a whole,
Here's one customer journey as an example. A user can reload a page, that why the next page of Page A can be Page A.
structure(list(V3.previous_event_url = c(NA, "Page A", "Page A",
"Page B", "Page C", "Page C"), V3.event_url = c("Page A", "Page A",
"Page B", "Page C", "Page B", "Page B"), V3.next_event_url = c("Page A",
"Page B", "Page C", "Page B", "Page C", NA)), row.names = c(NA,
-6L), spec = structure(list(cols = list(V3.previous_event_url = structure(list(), class = c("collector_character",
"collector")), V3.event_url = structure(list(), class = c("collector_character",
"collector")), V3.next_event_url = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ";"), class = "col_spec"), problems = <pointer: 0x0000022049cccb00>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
CodePudding user response:
My initial attempt shows this is not possible without other information (such as a time stamp or the order of rows in the data). Consider the following table.
prev | event | next
----- ------- ------
NA | A | A
A | A | B
A | B | C
B | C | B
C | B | C
B | C | B
C | B | NA
This table describes the sequence A, A, B, C, B, C, B
. We can tell this because it is the pattern that appears vertically in each column.
If we can count on the order of rows in the dataset, then the problem reduces to producing row numbers that reset to 1 each time prev
is NA
.
However, if we can not count on this structure (the question implies that we need to use only previous, current, and next URL) then numbering urls in sequence is ambiguous.
As an example, here are three possible numbering that respect the (previous, current, next) rules:
prev | event | next | eg1 | eg2 | eg3
----- ------- ------ ----- ----- -----
NA | A | A | 1 | 1 | 1
A | A | B | 2 | 2 | 2
A | B | C | 3 | 3 | 3
B | C | B | 4 | 6 | 4
C | B | C | 5 | 5 | ?
B | C | B | 6 | 4 | ?
C | B | NA | 7 | 7 | 5
eg1
is the 'correct' ordering. eg2
incorrectly swaps rows 4 and 6. eg3
incorrectly steps from row 4 to row 7, and leaves rows 5 and 6 as orphans.
(It may be easy to see these are wrong in this simple example, but in a dataset containing 100 browsing chains, each 20 urls long, there will be more ways things can go wrong, and it will be much harder to spot where they failed.)
CodePudding user response:
Here is an approach. It does not handled ambiguity well, so it can produce odd patterns where there is ambiguity.
First the setup:
library(dplyr)
# set example input
df = data.frame(
prev_url = c(NA, "A", "A", "B", "C", "B", "C"),
this_url = c("A", "A", "B", "C", "B", "C", "B"),
next_url = c("A", "B", "C", "B", "C", "B", NA),
stringsAsFactors = FALSE
)
# create an empty column to hold the order
df = df %>%
mutate(visit_order = ifelse(is.na(prev_url), 1, NA))
The base idea of the approach is to join the table to itself (matching prev to current url), look for the previous number, and label the next url in the sequence. We will have to repeat this process for each step in the sequence.
# a single iteration
df = df %>%
# join the table to itself to match prev url with current url
left_join(
df,
by = c("prev_url" = "this_url", "this_url" = "next_url"),
suffix = c("","_lag")
) %>%
# this mutate is optional - it eliminates warnings
mutate(
visit_order_lag = pmax(visit_order_lag, 0),
visit_order_lag = coalesce(visit_order_lag, 0)
) %>%
# this group by and summarise is how I have chosen to handle repetition
# otherwise the join increases the number of records
group_by(prev_url, this_url, next_url, visit_order) %>%
summarise(lag_order = max(visit_order_lag, na.rm = TRUE), .groups = "drop") %>%
# find the label for the prev url (1) and assign the next label (2)
mutate(visit_order = ifelse(is.na(visit_order) & lag_order == 1, 2, visit_order)) %>%
select(-lag_order)
It gives the following output:
prev_url this_url next_url visit_order
NA A A 1
A A B 2
A B C NA
B C B NA
C B C NA
C B NA NA
Careful inspection will show that this dataset has lost a row. This is because rows 4 and 6 in the original dataset are duplicates, so one of them is eliminated by the group_by
and summarise
. However, if you remove this step, you will instead see the size of the dataset grow. Hence it is an imperfect solution, but the best one I can offer.
We could now repeat another iteration of the code above. Because of its size, I turned it into a function:
increment_label = function(df, current_max_visit){
output = df %>%
left_join(
df,
by = c("prev_url" = "this_url", "this_url" = "next_url"),
suffix = c("","_lag")
) %>%
mutate(
visit_order_lag = pmax(visit_order_lag, 0),
visit_order_lag = coalesce(visit_order_lag, 0)
) %>%
group_by(prev_url, this_url, next_url, visit_order) %>%
summarise(lag_order = max(visit_order_lag, na.rm = TRUE), .groups = "drop") %>%
mutate(visit_order = ifelse(is.na(visit_order) & lag_order == current_max_visit, current_max_visit 1, visit_order)) %>%
select(-lag_order)
return(output)
}
We can then call this multiple times:
df %>%
increment_label(1) %>%
increment_label(2) %>%
increment_label(3)
Or use a for loop:
for(ii in 1:100){
df = df %>% increment_label(ii)
}
Here is the expected output:
prev_url this_url next_url visit_order
NA A A 1
A A B 2
A B C 3
B C B 4
C B C 5
C B NA 5
You will have to decide how to handle duplicate visit_order
values and the loss of duplicate records.
Testing on a non-repeating sequence (e.g. ABCDEFG
) produces the expected sequence without oddities.