How to find identical string patterns using SQL?-CodePudding

I'm cleaning up some data and have several rows that contain repetitive words within a string. For instance, "concrete-concrete" or "art-art". I could just do use a case when with a like to find these, however there are so many of these repetitions that it will take too long to find all of them. Is there a SQL function for finding duplicative patterns like this?

Here's some sample data:

category

concrete-concrete
art-art
concrete-art
music-classical
music-music-classical

CodePudding user response：

Please check the script below that is developed in SQL Server 2016 that might help you on your task.

CREATE TABLE #temp(Id int,[StringValue] varchar(100))

INSERT INTO #temp VALUES
(1,'word1-art-word2-art'),
(2,'concrete-concrete-word1-word2'),
(3,'word1-word2-art-concrete'),
(4,'art-art-concrete-concrete')

SELECT Id --Gives the ID of string where repetitive word(s) exist
FROM #temp d
WHERE EXISTS(
              SELECT value 
              FROM STRING_SPLIT(d.[StringValue],'-') 
              WHERE value<>'' 
              GROUP BY value 
              HAVING COUNT(*)>1
            )

SELECT Id, --shows the repetitive word(s) for each id.
  (
    SELECT state_code
    FROM
      (
         SELECT STUFF((SELECT CAST(', '   value AS VARCHAR(MAX)) 
         FROM STRING_SPLIT(d.[StringValue],'-')
         WHERE value<>N'' 
         GROUP BY value HAVING COUNT(*)>1
         FOR XML PATH ('')), 1, 2, '') AS state_code
      ) q
  ) repeatingwords
FROM #temp d
WHERE EXISTS(SELECT value FROM STRING_SPLIT(d.[StringValue],'-') WHERE value<>N'' GROUP BY value HAVING COUNT(*)>1)

CodePudding user response：

I would find the using the having function and delete them.

SELECT category, COUNT(*)
FROM users
GROUP BY category
HAVING COUNT(*) > 1