What is the best approach for large scale Paths and Funnels Analysis?-CodePudding

We have a big dataset of user actions on our internal apps. I am trying to create an algorithm for Paths & Funnels analytics which will take parameters for Paths (i.e. Start and End point) and a defined step of actions for Funnel. What is the best algorithm to program this with large data? The output should be just counts of users for specific set of actions like this :

Format of the file to scan:

UserID	Action	TS
1	A	06/04/2022
1	B	06/04/2022
1	C	06/04/2022
1	D	06/04/2022
2	G	06/04/2022
2	H	06/04/2022
2	K	06/04/2022

Algorithm input parameters:

For Path : User statistics on the start point A and end point F
For Funnel: User statistics on the defined steps A->B->C->D

Path	Count
A->B->C->D	385
G->H->K	89

where A,B,C,D,... are nodes for user actions or pages.

This should be easy using Python for a smaller set, but the issue is, I am worried about performance, as I am dealing with millions of records like this. Please help!

CodePudding user response：

Assuming that

...
1 A ts
1 B ts
...

in the input data means user 1 went A -> B

the algorithm is

CREATE new table paths_users_followed
CREATE new path
LOOP over data input rows, except last
   IF user in row equals user in row 1
      ADD action in row to path
      IF row 1 is last row
         ADD action in last to path
         ADD user, path to paths_users_followed
   ELSE
      ADD user, path to paths_users_followed
      CREATE new PATH
ENDLOOP

LOOP P over input of "path statistics"
    COUNT occurrences of P in paths_users_followed

This can be most easily and efficiently implemented using a high performance database engine - I would use SQLite.