Home > Blockchain >  How to get rid of duplicate SQL results when aggregrating from 2 linking tables without using distin
How to get rid of duplicate SQL results when aggregrating from 2 linking tables without using distin

Time:11-20

My question is a performance question.

Given this postgresql, I get this result.

select fp.fp_id post_id, 
tag_linking_table.sehset_tid hashtag
FROM soc_ent social_entity
JOIN fp on fp.fp_id = social_entity.se_id
LEFT JOIN sehset tag_linking_table on tag_linking_table.sehset_seid = social_entity.se_id
GROUP BY post_id, hashtag

enter image description here

I can see that the post ending in 89 has 2 hashtags. All good so far.

I now modify the query to get the following expected result.

select fp.fp_id post_id, 
jsonb_agg(tag_linking_table.sehset_tid) hashtag
FROM soc_ent social_entity
JOIN fp on fp.fp_id = social_entity.se_id
LEFT JOIN sehset tag_linking_table on tag_linking_table.sehset_seid = social_entity.se_id
GROUP BY post_id

enter image description here

All good so far. My hashtags are properly grouped. Now if i changed the query to get the likes count using the same method above, this is my expected result

select fp.fp_id post_id, 
count(social_entity_likes.se_likes_ca) likes
FROM soc_ent social_entity
JOIN fp on fp.fp_id = social_entity.se_id
LEFT JOIN se_likes social_entity_likes on social_entity.se_id = social_entity_likes.se_likes_seid 
GROUP BY post_id

enter image description here

So in summary, my post ending in 89 has 2 likes from the Likes linking table, and 2 Hashtags from the hashtag linking table.

Now comes the problem. I want to display the hashtags and the likes in one sql query, so i modified the sql to the following.

select fp.fp_id post_id, 
jsonb_agg(tag_linking_table.sehset_tid) hashtag,
count(social_entity_likes.se_likes_ca) likes
FROM soc_ent social_entity
JOIN fp on fp.fp_id = social_entity.se_id
LEFT JOIN sehset tag_linking_table on tag_linking_table.sehset_seid = social_entity.se_id
LEFT JOIN se_likes social_entity_likes on social_entity.se_id = social_entity_likes.se_likes_seid 
GROUP BY post_id

But oddly, my result was this:

enter image description here

The solution I found was to add the word distinct inside the aggregation function which gave me the correct results.

select fp.fp_id post_id, 
jsonb_agg(distinct tag_linking_table.sehset_tid) hashtag,
count(distinct social_entity_likes.se_likes_ca) likes
FROM soc_ent social_entity
JOIN fp on fp.fp_id = social_entity.se_id
LEFT JOIN sehset tag_linking_table on tag_linking_table.sehset_seid = social_entity.se_id
LEFT JOIN se_likes social_entity_likes on social_entity.se_id = social_entity_likes.se_likes_seid 
GROUP BY post_id

enter image description here

But I am told every where that distinct will degrade the performance (and especially since i will have very large tables).

Question: What caused my duplicate problem? and is distinct the correct way to fix it? And if so how can I get the same result without using the distinct keyword? Thanks.

CodePudding user response:

Your duplicates were caused because you had two matches from fp to sehset and separately two matches from fp to se_likes. You haven't specified any restriction between sehset and se_likes so you get all the possible combinations (just like you would if you didn't specify a join condition for any join).

But I am told every where that distinct will degrade the performance (and especially since i will have very large tables).

You probably need to stop listening to "everywhere" then.

Your only options are to apply some sort of "distinct" operation before the join or after it. There's logically no other alternative, is there?

You can see this sort of thing written with a group-by in a subquery sometimes:

SELECT fp.fp_id, like_counts.count
FROM fp
JOIN (
  SELECT something_matching_post_id AS post_id, count(*)
  FROM se_likes
  GROUP BY something_matching_post_id
) AS like_counts ON fp.fp_id = like_counts.fp_id
GROUP BY fp.fp_id

Whether that is any different from the other form in terms of performance depends on what other conditions you apply and how clever the planner is about each form.

CodePudding user response:

Your join LEFT JOIN sehset tag_linking_table on tag_linking_table.sehset_seid = social_entity.se_id is pulling duplicated results. Its hard to see exactly what is going on, but I think you are joining the 'Politics' tag in the tag linking table to records that point to both the 'Politics' and the 'Technology' tags in the social_entity table. Ditto with the 'Technology' tag in the tag linking table.

Fix it with an AND in your join. LEFT JOIN sehset tag_linking_table on tag_linking_table.sehset_seid = social_entity.se_id AND linking_table.sehset.tid = whatever the corresponding filed is in the social_entity table. That might not be exactly right, but figure out why it is duplicating the join by running a select * query on just the join of those to tables with a where that limits the results to those records, and then correct it with an AND.

  • Related