Home > Software design >  SQL Server: DISTINCT with joins
SQL Server: DISTINCT with joins

Time:06-05

I want to print total number of orders of each state year-wise. But this is printing multiple states and not distinct.

SELECT DISTINCT 
    customer_state, 
    COUNT(*), 
    YEAR(order_purchase_timestamp) AS year 
FROM
    olist_orders_dataset 
JOIN 
    olist_customers_dataset ON olist_orders_dataset.customer_id = olist_customers_dataset.customer_id
GROUP BY 
    YEAR(order_purchase_timestamp), customer_state

I am getting this output:

State Year Num_orders
AC 2020 123
AC 2020 1234
AC 2019 234

Here is the Required Output:

State Year Num_orders
AC 2020 19995
CA 2020 188891
AL 2019 11999

CodePudding user response:

Firstly, I don't think it should be necessary to point out that your "output" doesn't match your query. Different column order and names. And somehow 2 rows for CA (123 and 1234) become 1 row (19995)? Math isn't that difficult. Simple oversights of that nature suggest a lack of effort.

Your output suggests that the two values you see as "AC" are not actually the same. Typically this means there is a trailing character that is not displayable (e.g., tab or linefeed). If that is the case, then you must first "fix" your data and then fix the process that is populating your table.

To verify that this is the problem, you can converting the column to varbinary to see the hex values stored in it. Example:

select customer_state, cast(customer_state as varbinary(20)) as bin_state, count(*) 
from @tbl 
where customer_state like 'AC%'
group by customer_state order by customer_state
;

fiddle to demonstrate. If my guess is correct, then there is an important less to learn. Stop throwing code into a query to fix a problem without understanding the cause.

And one last note. Aggregation tends to produce rows in some order. That is an artifact of the execution plans. If the order of rows in your resultset matters (and it usually does), the query MUST have an ORDER BY clause. That applies to every query - not just those using aggregation.

CodePudding user response:

you need to describe the tables you mention to know what data it contains.

If you group (order_purchase_timestamp), then your information is correct as described.

What happens if you only group by customer_state?

  • Related