We have a big dataset of user actions on our internal apps. I am trying to create an algorithm for Paths & Funnels analytics which will take parameters for Paths (i.e. Start and End point) and a defined step of actions for Funnel. What is the best algorithm to program this with large data? The output should be just counts of users for specific set of actions like this :
Format of the file to scan:
UserID | Action | TS |
---|---|---|
1 | A | 06/04/2022 |
1 | B | 06/04/2022 |
1 | C | 06/04/2022 |
1 | D | 06/04/2022 |
2 | G | 06/04/2022 |
2 | H | 06/04/2022 |
2 | K | 06/04/2022 |
Algorithm input parameters:
- For Path : User statistics on the start point A and end point F
- For Funnel: User statistics on the defined steps A->B->C->D
Path | Count |
---|---|
A->B->C->D | 385 |
G->H->K | 89 |
where A,B,C,D,... are nodes for user actions or pages.
This should be easy using Python for a smaller set, but the issue is, I am worried about performance, as I am dealing with millions of records like this. Please help!
CodePudding user response:
Assuming that
...
1 A ts
1 B ts
...
in the input data means user 1 went A -> B
the algorithm is
CREATE new table paths_users_followed
CREATE new path
LOOP over data input rows, except last
IF user in row equals user in row 1
ADD action in row to path
IF row 1 is last row
ADD action in last to path
ADD user, path to paths_users_followed
ELSE
ADD user, path to paths_users_followed
CREATE new PATH
ENDLOOP
LOOP P over input of "path statistics"
COUNT occurrences of P in paths_users_followed
This can be most easily and efficiently implemented using a high performance database engine - I would use SQLite.