I am working on creating a table to help my company with attribution modeling. We have several data sets including invoice, company, people, and event data.
Our data is complex since we deal with B2B (business-to-business) clients. So it's not as simple as viewing a single person's event and attributing the invoice total to the events (or actions) they did.
Rather, our invoices have reference to a company ID and our people have reference to a company ID - then our people have reference to their events. So I am currently joining my tables based on this relationship and have a huge table with all of the information together.
It looks like this:
INVOICE_ID | INVOICE_DATE | INVOICE_TOTAL | PERSON_COMPANY_ID | PERSON_EMAIL | EVENT_NAME | EVENT_DATE | DAYS_BETWEEN_EVENT_AND_INVOICE |
---|---|---|---|---|---|---|---|
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Spoke to Sales Rep | 2/10/2022 | 25 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Form Submitted | 6/8/2021 | 272 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Spoke to Sales Rep | 2/10/2022 | 25 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Clicked Email | 3/21/2022 | -14 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Chat on Website | 3/2/2022 | 5 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Opened Email | 3/7/2022 | 0 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Spoke to Sales Rep | 2/10/2022 | 25 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Google Ad | 2/28/2022 | 7 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Google Ad | 3/1/2022 | 6 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Google Ad | 3/2/2022 | 5 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Google Ad | 3/14/2022 | -7 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Spoke to Sales Rep | 2/10/2022 | 25 |
111 | 3/7/2022 | $504.80 | ABC | [email protected] | Form Submitted | 12/2/2021 | 95 |
222 | 3/7/2022 | $504.80 | XYZ | [email protected] | Spoke to Sales Rep | 2/10/2022 | 25 |
222 | 3/7/2022 | $120.25 | XYZ | [email protected] | Spoke to Sales Rep | 6/3/2021 | 277 |
222 | 3/7/2022 | $120.25 | XYZ | [email protected] | Spoke to Sales Rep | 4/8/2021 | 333 |
222 | 3/7/2022 | $120.25 | XYZ | [email protected] | Spoke to Sales Rep | 6/4/2021 | 276 |
222 | 3/7/2022 | $120.25 | XYZ | [email protected] | Spoke to Sales Rep | 2/23/2022 | 12 |
222 | 3/7/2022 | $120.25 | XYZ | [email protected] | Spoke to Sales Rep | 2/23/2022 | 12 |
222 | 3/7/2022 | $120.25 | XYZ | [email protected] | Spoke to Sales Rep | 4/8/2021 | 333 |
222 | 3/7/2022 | $120.25 | XYZ | [email protected] | Spoke to Sales Rep | 6/4/2021 | 276 |
222 | 3/7/2022 | $120.25 | XYZ | [email protected] | Spoke to Sales Rep | 2/23/2022 | 12 |
222 | 3/7/2022 | $120.25 | XYZ | [email protected] | Spoke to Sales Rep | 2/23/2022 | 12 |
I would like to create a table that has columns for positions of the events based on the last five events that took place for an invoice. And only within the last 90 days of the invoice date. So I'd like to create a new table that looks something like this perhaps:
INVOICE_ID | INVOICE_DATE | INVOICE_TOTAL | PERSON_COMPANY_ID | EVENT_5 | EVENT_5_EMAIL | EVENT_5_DATE | Event 4 | Event 4 Email | Event 4 Date | Event 3 | Event 3 Email | Event 3 Date | Event 2 | Event 2 Email | Event 2 Date | Event 1 | Event 1 Email | Event 1 Date |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
111 | 3/7/2022 | $504.80 | ABC | Google Ad | [email protected] | 2/28/2022 | Google Ad | [email protected] | 3/1/2022 | Google Ad | [email protected] | 3/2/2022 | Chat on Website | [email protected] | 3/2/2022 | Opened Email | [email protected] | 3/7/2022 |
222 | 3/7/2022 | $120.25 | XYZ | Spoke to Sales Rep | [email protected] | 2/23/2022 | Spoke to Sales Rep | [email protected] | 2/23/2022 | Spoke to Sales Rep | [email protected] | 2/23/2022 | Spoke to Sales Rep | [email protected] | 2/23/2022 | Spoke to Sales Rep | [email protected] | 2/23/2022 |
To try and create this I've added the DAYS_BETWEEN_EVENT_AND_INVOICE column as you see in the first table. I think using that to filter out negative values can get me closer but I'm not sure if this is the best way to go about attribution. Nor am I sure how to essentially loop through my table and fill out my second table based on these conditions: Last 5 events to an invoice, only last 90 days.
I'm using SQL, a Snowflake data warehouse and ultimately Power BI to visualize this data.
CodePudding user response:
You can do this in Power Query (=>Transform)
In the produced data from this query, there is a probable error in the invoice total for invoice 222. This is due to the probable typo wherein the most recent event row of that invoice has the same value as invoice 111.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("tdRPb4IwFADwr9KQHR1tXxHYTWe2HZfMw/4YD1WaWaEtge7gt1/VgNnEBAUupXmkv/foK10sPEqpN/IYjjAQADe9G5PAj4mbTR9nbtyajZ6sjcnWRuVc73z3dOF5blKBrEFznokSvYncBQFTUjkw9pajNr7QeteQ4NkUCs1/VkpaKxIXCHG8N/YcRNAN7696qRr4WSbXqUjQk IyOwBAK GeBp3oDbfIaPQuVqW04ohXRKeiX3Oh/9Rcrydd2IG3 sWY70ygaXJQIa5WR32hDNetC/sz 2nZvzqD oy1/HrFi3TIll3wz35tCscN2XMPRxsOYKP98fnlRmu6nLZzngLx3cuK5zrZTawobQs/xOx0M0X9 8Hp5mOMDVF/cKo/7N93vWWVQ6Gdn29kNqS/NUXC9ZAduC7DLT24LsMtu6S4tbf6y18=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [INVOICE_ID = _t, INVOICE_DATE = _t, INVOICE_TOTAL = _t, PERSON_COMPANY_ID = _t, PERSON_EMAIL = _t, EVENT_NAME = _t, EVENT_DATE = _t, DAYS_BETWEEN_EVENT_AND_INVOICE = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"INVOICE_ID", Int64.Type}, {"INVOICE_DATE", type date}, {"INVOICE_TOTAL", Currency.Type}, {"PERSON_COMPANY_ID", type text}, {"PERSON_EMAIL", type text}, {"EVENT_NAME", type text}, {"EVENT_DATE", type date}, {"DAYS_BETWEEN_EVENT_AND_INVOICE", Int64.Type}}),
//removed this column since we won't need it
#"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"DAYS_BETWEEN_EVENT_AND_INVOICE"}),
//Group by Invoice
#"Grouped Rows" = Table.Group(#"Removed Columns", {"INVOICE_ID"}, {
{"within90", (t)=> let
//Filter the table by duration between invoice date and event date
//then sort descending by event date and split off the first five rows
// note that split will be populated by fewer rows if there are not five dates in the range
x = Table.Split(
Table.Sort(
Table.SelectRows(t,
each Duration.Days([INVOICE_DATE]-[EVENT_DATE]) < 90 and
Duration.Days([INVOICE_DATE]-[EVENT_DATE]) >= 0),
{"EVENT_DATE", Order.Descending}),
5){0},
//generate a list of records, along with their field names, for those events
events = List.Generate(()=>
[evEM=x{0}[PERSON_EMAIL] , evN=x{0}[EVENT_NAME], evD=x{0}[EVENT_DATE] , idx=0],
each [idx] < Table.RowCount(x),
each [evEM=x{[idx] 1}[PERSON_EMAIL] , evN=x{[idx] 1}[EVENT_NAME], evD=x{[idx] 1}[EVENT_DATE] , idx=[idx] 1],
each Record.FromList(
{[evN],[evEM],[evD]},
{"EVENT_" & Text.From([idx] 1),
"EVENT_" & Text.From([idx] 1) & " EMAIL",
"EVENT_" & Text.From([idx] 1) & " DATE"})),
//combine the generated records with the first row of each subTable to create new table rows
newTable = Record.Combine({t{0}} & List.Reverse(events))
in
newTable}
}),
//expand the records to new columns and set the data types
#"Expanded within90" = Table.ExpandRecordColumn(#"Grouped Rows", "within90", {"INVOICE_DATE", "INVOICE_TOTAL", "PERSON_COMPANY_ID", "PERSON_EMAIL", "EVENT_NAME", "EVENT_DATE", "EVENT_5", "EVENT_5 EMAIL", "EVENT_5 DATE", "EVENT_4", "EVENT_4 EMAIL", "EVENT_4 DATE", "EVENT_3", "EVENT_3 EMAIL", "EVENT_3 DATE", "EVENT_2", "EVENT_2 EMAIL", "EVENT_2 DATE", "EVENT_1", "EVENT_1 EMAIL", "EVENT_1 DATE"}, {"INVOICE_DATE", "INVOICE_TOTAL", "PERSON_COMPANY_ID", "PERSON_EMAIL", "EVENT_NAME", "EVENT_DATE", "EVENT_5", "EVENT_5 EMAIL", "EVENT_5 DATE", "EVENT_4", "EVENT_4 EMAIL", "EVENT_4 DATE", "EVENT_3", "EVENT_3 EMAIL", "EVENT_3 DATE", "EVENT_2", "EVENT_2 EMAIL", "EVENT_2 DATE", "EVENT_1", "EVENT_1 EMAIL", "EVENT_1 DATE"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Expanded within90",{{"INVOICE_DATE", type date}, {"INVOICE_TOTAL", type number}, {"PERSON_COMPANY_ID", type text}, {"PERSON_EMAIL", type text}, {"EVENT_NAME", type text}, {"EVENT_DATE", type date}, {"EVENT_5", type text}, {"EVENT_5 EMAIL", type text}, {"EVENT_5 DATE", type date}, {"EVENT_4", type text}, {"EVENT_4 EMAIL", type text}, {"EVENT_4 DATE", type date}, {"EVENT_3", type text}, {"EVENT_3 EMAIL", type text}, {"EVENT_3 DATE", type date}, {"EVENT_2", type text}, {"EVENT_2 EMAIL", type text}, {"EVENT_2 DATE", type date}, {"EVENT_1", type text}, {"EVENT_1 EMAIL", type text}, {"EVENT_1 DATE", type date}})
in
#"Changed Type1"
CodePudding user response:
Gave a shot at solution using CTE and pivot.
with cte1 as (
select * from
(
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,
event_name, 'event_'||rn event1
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,
event_name,dd,rn
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,
event_name,datediff(day,event_date,invoice_date) dd,
row_number() over (partition by invoice_id order by dd desc) as rn
from invoice1 where dd<=90
)
where rn<=5
) x
)
pivot (max(event_name)
for
event1 in ('event_1','event_2','event_3','event_4','event_5')) as pvt
),
cte2 as (
select * from
(
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,PERSON_EMAIL,
'event_'||rn||'_email' event1
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,PERSON_EMAIL,
dd,rn
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,PERSON_EMAIL,
datediff(day,event_date,invoice_date) dd,
row_number() over (partition by invoice_id order by dd desc) as rn
from invoice1 where dd<=90
)
where rn<=5
) x
)
pivot (max(PERSON_EMAIL)
for
event1 in ('event_1_email','event_2_email','event_3_email','event_4_email','event_5_email')) as pvt
),
cte3 as (
select * from
(
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,EVENT_DATE,
'event_'||rn||'_date' event1
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,EVENT_DATE,
dd,rn
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,EVENT_DATE,
datediff(day,event_date,invoice_date) dd,
row_number() over (partition by invoice_id order by dd desc) as rn
from invoice1 where dd<=90
)
where rn<=5
) x
)
pivot (max(EVENT_DATE)
for
event1 in ('event_1_date','event_2_date','event_3_date','event_4_date','event_5_date')) as pvt
)
select
cte1.invoice_id,cte1.invoice_date,cte1.invoice_total,cte1.person_company_id,
cte1."'event_1'",cte2."'event_1_email'",cte3."'event_1_date'",
cte1."'event_2'",cte2."'event_2_email'",cte3."'event_2_date'",
cte1."'event_3'",cte2."'event_3_email'",cte3."'event_3_date'",
cte1."'event_4'",cte2."'event_4_email'",cte3."'event_4_date'",
cte1."'event_5'",cte2."'event_5_email'",cte3."'event_5_date'"
from cte1,cte2,cte3
where cte1.invoice_id=cte2.invoice_id
and cte2.invoice_id=cte3.invoice_id ;
Main query for this (within CTEs) is -
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,EVENT_DATE,datediff(day,event_date,invoice_date) dd,
row_number() over (partition by invoice_id order by dd desc) as rn from invoice1 where dd<=90
Followed table definition as -
table invoice1
(
INVOICE_ID number,
INVOICE_DATE date,
INVOICE_TOTAL varchar2(100),
PERSON_COMPANY_ID varchar2(100),
PERSON_EMAIL varchar2(100),
EVENT_NAME varchar2(100),
EVENT_DATE date
)