Home > Enterprise >  Count the most popular occurrences of a hashtag in a string column postgreSQL
Count the most popular occurrences of a hashtag in a string column postgreSQL

Time:01-30

I have a column in my dataset with the following format:

hashtags
1 [#newyears, #christmas, #christmas]
2 [#easter, #newyears, #fourthofjuly]
3 [#valentines, #christmas, #easter]

I have managed to count the hashtags like so:

SELECT hashtags, (LENGTH(hashtags) - LENGTH(REPLACE(hashtags, ',', ''))   1) AS hashtag_count
FROM full_data
ORDER BY hashtag_count DESC NULLS LAST

But I'm not sure if it's possible to count the occurrences of each hashtag. Is it possible to return the count of the most popular hashtags in the following format:

hashtags     count
christmas     3
newyears      2

The datatype is just varchar, but I'm a bit confused on how I should approach this. Any help would be appreciated!

CodePudding user response:

I think you should split all the data in Array to record and then count it with Group by. Something like this query

SELECT hashtag, count(*) as hashtag_count
FROM full_data, unnest(hashtags) s(hashtag)
GROUP BY hashtag
ORDER BY hashtag_count DESC

Hopefully, it will match your request!

CodePudding user response:

That's a bad idea to store this data. It's risky because we don't know whether the text will always be stored in exactly this form. Better save the different strings in separate columns.

Anyway, if you can't improve that and must deal with this structure, we could basically use a combination of UNNEST, STRING_TO_ARRAY and GROUP BY to split the hashtags and count them.

So the general idea is something like this:

WITH unnested AS
(SELECT
UNNEST(STRING_TO_ARRAY(hashtags, ',')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag) 
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;

Due to the braces and spaces within your column, this will not produce the correct result.

So we could additionaly use TRIM and TRANSLATE to get rid of all other things except the hashtags.

With your sample data, following construct will produce the intended outcome:

WITH unnested AS
(SELECT
TRIM(TRANSLATE(UNNEST(STRING_TO_ARRAY(hashtags, ',')),'#,[,]','')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag) 
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;

See here

But as already said, this is unpleasant and risky.

So if possible, find out which hashtags are possible (it seems these are all special days) and then create columns or a mapping table for them.

This said, store 0 or 1 in the column to indicate whether the hashtag appears or not and then sum the values per column.

  • Related