Hi have a data set which has product, site, station, date and some numerical fields such as sn_count, blob etc. Within every combination of product, site and station, if there are multiple entries for the same month from different dates, I want to only pick one row with max sn count in that month.
The code I have right now is working for the most part. It is filtering out rows with lesser sn counts in that month. But it gives me all rows with the same max sn count. Whereas, I just want 1 from a month.
This is my code:
FROM insight_info_temp a
INNER JOIN
(
SELECT distinct b.product_code,b.site_name,b.station_type,to_char(b.date_b, 'YYYY-MM') as date_new,
MAX(dist_sn_count_at_blob) as max_sn
FROM insight_info_temp b
GROUP BY b.product_code,b.site_name,b.station_type,to_char(b.date_b, 'YYYY-MM')
) b
ON a.product_code = b.product_code and
a.site_name = b.site_name and
a.station_type = b.station_type and
to_char(a.date_b, 'YYYY-MM') = b.date_new
AND a.dist_sn_count_at_blob = b.max_sn
where a.product_code = 'D00'
and a.site_name = 'F00' and a.station_type = 'A00';
The highlighted rows have the same sn count and is the max sn count for that month. I however, only want one of these rows. Not both.
CodePudding user response:
My guess is that you have two observations with the same dist_sn_count_at_blob
.
This is a candidate for PostgreSQL's distinct on
.
Please try something like this:
select distinct on (product_code, site_name, station_type,
to_char(date_b, 'YYYY-MM'))
dist_sn_count_at_blob, last_updated_at_pkey, <other columns>
from insight_info_temp
where a.product_code = 'D00'
and a.site_name = 'F00'
and a.station_type = 'A00'
order by product_code, site_name, station_type,
to_char(date_b, 'YYYY-MM'), dist_sn_count_at_blob desc;