Home > Mobile >  SQL query to get binary column based on existing column with group by function
SQL query to get binary column based on existing column with group by function

Time:10-30

I'm trying to use MySQL to Group based on two columns, choose the most recent date, and create two binary columns to records status. Here's an example table

_______________________________________________________________________
Letters | Numbers|         dates        |   score    |  random_status |
_______________________________________________________________________
A       |   2    | 2021-09-29 0:00:00   |    0.3     |   Sent         |
A       |   2    | 2021-10-01 0:00:00   |    1.4     |   Received     |
A       |   5    | 2021-10-04 0:00:00   |    0.8     |   Sent         |
A       |   7    | 2021-10-20 0:00:00   |    0.9     |   Sent         |
A       |   7    | 2021-10-20 0:20:00   |    0.5     |   Sent         |
R       |   7    | 2021-09-09 0:20:54   |    0.2     |   Sent         |
R       |   7    | 2021-10-14 0:00:00   |    2.5     |   Received     |
R       |   2    | 2021-10-07 0:00:00   |    0.7     |   Received     |
R       |   2    | 2021-09-14 0:00:00   |    1.7     |   Sent         |
C       |   5    | 2021-10-07 0:00:00   |    2.1     |   Sent         |
C       |   5    | 2021-10-25 0:00:00   |    3.5     |   Sent         |
C       |   7    | 2021-08-18 0:00:00   |    1.9     |   Sent         |
C       |   7    | 2021-08-29 0:00:00   |    0.6     |   Received     | 
C       |   2    | 2021-02-01 0:00:00   |    1.8     |   Sent         |

I want to group base on Letters and Numbers columns, and I want the latest date, with the latest score. I want to create two new columns based on the status column, that says if a letter and number combination was ever in sent or received status.

Something that looks like this:

Letters|    Numbers |   latest_date     |latest_score|  Has_sent| Has_received|
A      |       2    |2021-10-01 0:00:00 |     1.4    |     1    |      1      |
A      |       5    |2021-10-04 0:00:00 |     0.8    |     1    |      0      |
A      |       7    |2021-10-20 0:20:00 |     0.5    |     1    |      0      |
C      |       2    |2021-02-01 0:00:00 |     1.8    |     1    |      0      |
C      |       5    |2021-10-25 0:00:00 |     3.5    |     1    |      0      |
C      |       7    |2021-08-29 0:00:00 |     0.6    |     1    |      1      |
R      |       2    |2021-10-07 0:00:00 |     0.7    |     1    |      1      |
R      |       7    |2021-10-14 0:00:00 |     2.5    |     1    |      1      |

I used the following query

SELECT t1.Letters, t1.Numbers, MAX(t1.dates) as latest_date, t1.score as latest_score,
case when status = "Sent" then 1 else 0 end AS Has_sent,
case when status = "Received" then 1 else 0 end AS Has_received
FROM dummy_data t1
WHERE
t1.dates IN (SELECT MAX(t2.dates) FROM dummy_data t2
WHERE t1.Letters = t2.Letters AND t1.Numbers = t2.Numbers)
GROUP BY t1.Letters, t1.Numbers;

The last two columns, has_sent and has_reported, are not showing as expected. Instead I get it based on the max date. Is it doable to get it to be binary based on that status ever existed per Letter and Number combination?

CodePudding user response:

I've never used case, but from what I'm reading it appears row-wise? In that case, it won't work on both as the group by aggregates.

SELECT 
    t1.Letters, 
    t1.Numbers, 
    MAX(t1.dates)        AS latest_date, 
    LAST_VALUE(t1.score) AS latest_score, /* Since your rows appear to be chronological I assume this works */
    SELECT EXISTS(
        SELECT 1 
        FROM dummy_data 
        WHERE 
            Letters       = t1.Letters AND 
            random_status = "Sent"
    ) AS has_sent,
    SELECT EXISTS(
        SELECT 1 
        FROM dummy_data 
        WHERE 
            Letters       = t1.Letters AND 
            random_status = "Received"
    ) AS has_received
FROM dummy_data AS t1
GROUP BY t1.Letters, t1.Numbers
;

If your tables are large as many time-series are, those subqueries won't be sustainable. From the business logic I see in your example talbes:

  1. A Letter/Number pair must have a sent status.
  2. A Letter/Number pair can only have a received status if it has a historical sent status.
  3. A Letter/Number pair can have no more than one sent row and one received row. (unique constraint on Letters,Numbers,random_status)

Here is an alternative assuming those points.


SELECT 
    d_sent.Letters AS Letters,
    d_sent.Numbers AS Numbers,
    MAX(d_sent.dates, d_rcvd.dates) AS latest_date,
    (
        CASE
            WHEN isnull(d_rcvd.score) THEN d_sent.score
            ELSE d_rcvd.score
        END
    ) AS latest_score,
    1 AS is_sent,
    cast(isnull(d_rcvd.score) AS SIGNED INTEGER) AS is_received
FROM (
    SELECT * FROM dummy_data WHERE random_status="Sent"
) AS d_sent
LEFT JOIN (
    SELECT * FROM dummy_data WHERE random_status="Received"
) AS d_rcvd
ON
   d_sent.Letters = d_rcvd.Letters AND
   d_sent.Numbers = d_rcvd.Numbers
;

I'm sure there are some syntactical tweaks that must be made. Let me know how it goes.

Checking if left join is null

MySQL case function

Edit: It appears case is indeed row-wise.

CodePudding user response:

Try:

select tbl1.Letters,
       tbl1.Numbers,
       tbl1.latest_date,
       tbl1.latest_score score, 
       tbl2.Has_sent,
       tbl2.Has_received
from    (
         select Letters, 
                Numbers, 
                max(dates) as `latest_date`, 
                score as `latest_score`
         from dummy_data
         where dates in ( select max(dates) 
                          from dummy_data 
                          group by Letters, Numbers )
         group by Letters, Numbers
         ) as tbl1 
     inner join 
     
        (
           select Letters, 
            Numbers ,
           max(case when random_status = "Sent" then 1 else 0 end) AS Has_sent,
           max(case when random_status = "Received" then 1 else 0 end) AS Has_received
           from dummy_data
           group by Letters, Numbers
        
        ) as tbl2 on tbl1.Letters=tbl2.Letters and tbl1.Numbers=tbl2.Numbers;

Demo: https://www.db-fiddle.com/f/usu3XK7Gn8gGqQnusmCiLk/4

  • Related