Home > database >  Extract and display the single most frequently used Tag from a column value containing multiple Tags
Extract and display the single most frequently used Tag from a column value containing multiple Tags

Time:03-22

Below is the table I have created and I have also inserted the values:

CREATE TABLE Tag (
    Tag_Name NVARCHAR(50)
)

INSERT INTO Tag (Tag_Name)
VALUES('<wedding-crashers>')
INSERT INTO Tag (Tag_Name)
VALUES('<analysis><star-wars>')
INSERT INTO Tag (Tag_Name)
VALUES('<comedy><the-pink-panther>')
INSERT INTO Tag (Tag_Name)
VALUES('<plot-explanation><analysis><ending><tree-of-life>')
INSERT INTO Tag (Tag_Name)
VALUES('<plot-explanation><the-departed>')
INSERT INTO Tag (Tag_Name)
VALUES('<star-wars><darth-vader><casting>')
INSERT INTO Tag (Tag_Name)
VALUES('<analysis><shutter-island>')
INSERT INTO Tag (Tag_Name)
VALUES('<tree-of-life>')
INSERT INTO Tag (Tag_Name)
VALUES('<analysis><tree-of-life>')
INSERT INTO Tag (Tag_Name)
VALUES('<inception><existenz>')
INSERT INTO Tag (Tag_Name)
VALUES('<effects><melancholia>')
INSERT INTO Tag (Tag_Name)
VALUES('<hollywood><iranian>')
INSERT INTO Tag (Tag_Name)
VALUES('<plot-explanation><pontypool>')
INSERT INTO Tag (Tag_Name)
VALUES('<the-matrix><sequels><plot-explanation>')
INSERT INTO Tag (Tag_Name)
VALUES('<editing><cut><metropolis>')
INSERT INTO Tag (Tag_Name)
VALUES('<remake><akira>')
INSERT INTO Tag (Tag_Name)
VALUES('<plot-explanation><ending><blue-valentine>')
INSERT INTO Tag (Tag_Name)
VALUES('<star-wars><plot-explanation>')
INSERT INTO Tag (Tag_Name)
VALUES('<plot-explanation><saving-private-ryan>')
INSERT INTO Tag (Tag_Name)
VALUES('<film-industry>')
INSERT INTO Tag (Tag_Name)
VALUES('<analysis><pulp-fiction>')
INSERT INTO Tag (Tag_Name)
VALUES('<analysis><back-to-the-future><time-travel>')
INSERT INTO Tag (Tag_Name)
VALUES('<analysis><quentin-tarantino><film-techniques><kill-bill>')
INSERT INTO Tag (Tag_Name)
VALUES('<alfred-hitchcock><psychological-thriller><vertigo>')
INSERT INTO Tag (Tag_Name)
VALUES('<production><bridesmaids>')
INSERT INTO Tag (Tag_Name)
VALUES('<star-wars>')
INSERT INTO Tag (Tag_Name)
VALUES('<book-adaptation><harry-potter>')
INSERT INTO Tag (Tag_Name)
VALUES('<film-techniques><animation>')
INSERT INTO Tag (Tag_Name)
VALUES('<remake><fantomas>')
INSERT INTO Tag (Tag_Name)
VALUES('<muppets>')
INSERT INTO Tag (Tag_Name)
VALUES('<muppets>')

Now the question below is:

Which single tag is most frequently used?

In order to achieve this result, this is what I have tried:

SELECT Tag_Name AS 'Single most frequent Tag'
    , MAX([Tag Count]) AS 'Maximum Tag Count'
FROM (
    SELECT Tag_Name,COUNT(Tag_Name) AS 'Tag Count'
    FROM Tag
    GROUP BY Tag_Name
) A
GROUP BY Tag_Name
ORDER BY [Maximum Tag Count] DESC

And this is the result I have got which is shown below:

enter image description here

The problem with this result is that it is showing muppets is the most frequently used tag. That is because the query recognized muppets as the only column value that has single tag in it.

If you see from the screenshot, probably plot-explanation OR star-wars is the most frequently used tag but they are not by themselves, and that is what my query is not able to recognize and isolate them.

What I am trying to achieve is to extract the most frequently used tag like plot-explanation OR star wars and Display the result with a single row something like which is shown below:

Single most common Tag  Maximum Tag Count
<plot-explanation>      26

I tried to achieve this result with my query but my query is not able to extract or isolate a tag like plot-explanation OR star wars. Rather it only recognizes muppets that is because muppets is by itself and it is being repeated by itself in 2 different rows.

How can this be fixed?

CodePudding user response:

You should normalize your data and represent one tag per record. Use this table definition:

CREATE TABLE Tag (
    ID int Tag_Group NOT NULL,
    Tag_Name NVARCHAR(50)
)

Here is the first part of your insert script, updated so that there is only one tag per record:

INSERT INTO Tag (ID, Tag_Name)
VALUES (1, '<wedding-crashers>');
INSERT INTO Tag (ID, Tag_Name)
VALUES (2, '<analysis>');
INSERT INTO Tag (ID, Tag_Name)
VALUES (2, '<star-wars>');
INSERT INTO Tag (ID, Tag_Name)
VALUES (3, '<comedy>');
INSERT INTO Tag (ID, Tag_Name)
VALUES (3, '<the-pink-panther>');

Now to find the tag(s) which have the most usage, you only need a simple aggregation query:

SELECT TOP 1 WITH TIES Tag_Name
FROM Tag
GROUP BY Tag_Name
ORDER BY COUNT(*) DESC;

CodePudding user response:

If you are using SQL Server 2016 or above you may make use of string_split function as below however inserting your data in a normalized manner (as mentioned by Tim) will be a wise choice.

declare @tbl table(string varchar(100))

insert into @tbl
values('<muppets>'),('<plot-explanation><analysis><ending><tree-of-life>')
,('<plot-explanation><blue-valentine>'),('<muppets>')

;with cte as(
select '<'   t1.value as tag, COUNT(t1.value) as [count]
from 
@tbl t
cross apply
(
select *
from string_split(t.string,'<')
)t1
where t1.value <> ''
group by t1.value
)
select tag [Single most common tag],MAX([count])[Maximum tag count]
from cte
where [count] = (select max([count]) from cte)
group by tag

Note: I have just inserted few sample values for the sake of the output.

  • Related