I'm trying to create something like a correlation matrix in SQL server, looking at relationships between values in a single field. The goal is something like this (I am restricted from sharing the actual fields and values): If I had a list of purchases and products, can I output a matrix that shows how many individuals who purchased a given item also purchased another distinct item. Example: If this were my data:
[Account Item Amount
ID1 Bands $20.00
ID2 Bands $20.00
ID4 Foam Roller $40.00
ID5 Foam Roller $40.00
ID3 Shirt $30.00
ID1 Weights $100.00
ID4 Weights $100.00
ID1 Yoga Mat $25.00
ID2 Yoga Mat $25.00
ID4 Yoga Mat $25.00
ID5 Yoga Mat $25.00][1]
I'd be looking to know how many accounts who bought a Yoga Mat also bought a Foam Roller, etc. My desired output is something like this:
[Bands Foam Roller Shirt Weights Yoga Mat
Bands 2 1 1
Foam Roller 2 1 1
Shirt 1
Weights 2 2
Yoga Mat 2 2 1 4][1]
The business question is two-fold:
- How many individuals making a purchase in one product category also purchased in another category?
- What was the average spend in each category?
I've tried a pivot query but it isn't working. I'm sure I haven't structured it correctly. Does anyone have a suggestion or another thread I should look at? (I followed this one, but it isn't getting me there with only one field: Creating a correlation matrix in SQL Server)
My result set from this is just a single row with the item names across the top and totals.
edited because my rows and columns were all run together; now pasted as images. edited again due to feedback that the 'convert rows to columns' question is the same. Unfortunately, it isn't. I can get my rows into columns with values; I've done that.
What I'm really trying to do is see the crosstab or overlap of values in columns vs each other. eg How many Yoga Mat purchasers also bought a Foam Roller?
CodePudding user response:
Something like this perhaps:
if object_id('tempdb..#t') is not null
drop table #t
select LTRIM(rtrim(col1)) AS id, ltrim(rtrim(col2)) AS prod, stuff(ltrim(rtrim(col3)), 1, 1, '') AS price
into #t
from
(
VALUES (N'ID1', 'Bands ','$20.00 ')
, (N'ID2', 'Bands ','$20.00 ')
, (N'ID4', 'Foam Roller ','$40.00 ')
, (N'ID5', 'Foam Roller ','$40.00 ')
, (N'ID3', 'Shirt ','$30.00 ')
, (N'ID1', 'Weights ','$100.00 ')
, (N'ID4', 'Weights ','$100.00 ')
, (N'ID1', 'Yoga Mat ','$25.00 ')
, (N'ID2', 'Yoga Mat ','$25.00 ')
, (N'ID4', 'Yoga Mat ','$25.00 ')
, (N'ID5', 'Yoga Mat ','$25.00')
) t (col1, col2, col3)
;with cte as (
select distinct prod from #t
)
, cte2 AS (
select c.prod, t2.prod AS prod2, COUNT(*) AS cnt
from CTE c
inner join #t t
ON t.prod = c.prod
inner join #t t2
ON t2.id = t.id
GROUP BY c.prod, t2.prod
)
select prod, isnull(bands, 0) AS bands, isnull([foam roller],0) AS [foam roller], isnull(shirt, 0) AS shirt, isnull(Weights, 0) AS Weights, ISNULL([Yoga mat], 0) AS [Yoga Mat]
from cte2 c
pivot (MAX(c.cnt) for prod2 IN ([Bands], [Foam Roller], [Shirt],[Weights], [Yoga Mat])) p
CodePudding user response:
You will need to join the table to itself before feeding it to the PIVOT
.
The pivot data value must be separate from the row and column values. In this case Account
is used, but it could just as well have been 1 AS DummyValue
.
SELECT PVT.*
FROM (
SELECT First.Account, First.Item, Second.Item AS OtherItem
FROM @Data First
JOIN @Data Second ON Second.Account = First.Account
) D
PIVOT (
COUNT(Account)
FOR OtherItem IN ([Bands], [Foam Roller], [Shirt], [Weights], [Yoga Mat])
) PVT
Result:
Item | Bands | Foam Roller | Shirt | Weights | Yoga Mat |
---|---|---|---|---|---|
Bands | 2 | 0 | 0 | 1 | 2 |
Foam Roller | 0 | 2 | 0 | 1 | 2 |
Shirt | 0 | 0 | 1 | 0 | 0 |
Weights | 1 | 1 | 0 | 2 | 2 |
Yoga Mat | 2 | 2 | 0 | 2 | 4 |
See this db<>fiddle for a demo.