I have a SQL Server table with an ID and a text column with size of ~ 15,000 words. For each ID, I want to return a list of every unique word and the number of times it occurred.
CREATE TABLE MyComments (
Textid char(20) NOT NULL,
Comments VARCHAR(MAX)
)
Answer would look like
TextID Word WordCount
------- ------ ----------
1 cost 5
1 expense 7
2 cost 12
2 revenue 11
..
CodePudding user response:
This is not really something SQL Server is designed to do; it will be extremely expensive to do this for a table that is large (large number of rows, extremely long text like 15K, or both). If your table is small:
SELECT c.Textid, Word = s.value, WordCount = COUNT(*)
FROM dbo.MyComments AS c
CROSS APPLY STRING_SPLIT(c.Comments, ' ') AS s
GROUP BY c.Textid, s.value
ORDER BY Textid;
- Example db<>fiddle
At scale, this query will be terrible.
As @lptr pointed out, likely better at large scale to rearrange this query in a more complicated way so that the splitting is separated from the rest of the logic:
SELECT c.Textid, Word = s.value, s.WordCount
FROM dbo.MyComments AS c
CROSS APPLY
(
SELECT value = CONVERT(varchar(255), value),
WordCount = COUNT(*)
FROM STRING_SPLIT(c.Comments, ' ')
WHERE LEN(value) > 0
GROUP BY CONVERT(varchar(255), value)
) AS s
ORDER BY c.textid;
But this is still pretty slow (for an 8,000-row table it did almost 600,000 reads and took ~15 seconds). I don't believe SQL Server is the place to do this work.