Home > Net >  SQL Server Word Frequency for Each ID
SQL Server Word Frequency for Each ID

Time:10-04

I have a SQL Server table with an ID and a text column with size of ~ 15,000 words. For each ID, I want to return a list of every unique word and the number of times it occurred.

CREATE TABLE MyComments (
Textid char(20) NOT NULL,
Comments VARCHAR(MAX)
)

Answer would look like
    TextID  Word     WordCount
    ------- ------   ----------
    1        cost     5
    1        expense  7
    2        cost     12
    2        revenue  11
 

..

CodePudding user response:

This is not really something SQL Server is designed to do; it will be extremely expensive to do this for a table that is large (large number of rows, extremely long text like 15K, or both). If your table is small:

SELECT c.Textid, Word = s.value, WordCount = COUNT(*)
  FROM dbo.MyComments AS c
  CROSS APPLY STRING_SPLIT(c.Comments, ' ') AS s
  GROUP BY c.Textid, s.value
  ORDER BY Textid;

At scale, this query will be terrible.

As @lptr pointed out, likely better at large scale to rearrange this query in a more complicated way so that the splitting is separated from the rest of the logic:

SELECT c.Textid, Word = s.value, s.WordCount
  FROM dbo.MyComments AS c
  CROSS APPLY
  (
    SELECT value = CONVERT(varchar(255), value),
           WordCount = COUNT(*)
      FROM STRING_SPLIT(c.Comments, ' ')
      WHERE LEN(value) > 0
      GROUP BY CONVERT(varchar(255), value)
  ) AS s
  ORDER BY c.textid;

But this is still pretty slow (for an 8,000-row table it did almost 600,000 reads and took ~15 seconds). I don't believe SQL Server is the place to do this work.

  • Related