Home > other >  mysql select count of distinct rows based on newest date and insert into another table
mysql select count of distinct rows based on newest date and insert into another table

Time:02-05

As the subject states, I have a table with multiple rows per user and I need to get a count of all rows for each user with the data from the row with the oldest date AND then insert one new row, with the count value, into another table.

I am importing an attendance list CSV file into a temporary table...which is working fine. But now I need to process all of the individual records into a single summary record to be added to the final table.

tempTable:
id   email      tDate        cValue  col4    col5   col6
==========================================================
1    [email protected]    2021-01-01    1      foo     bar    foobar
2    [email protected]    2021-01-02    1      bar     foo    barfoo      
3    [email protected]    2021-02-01    1      foo     bar    foobar
4    [email protected]    2021-01-15    1      bah     hab    bahhab
5    [email protected]    2021-02-15    1      hab     bah    habbah
5    [email protected]    2021-03-01    1      bar     foo    barfoo
6    [email protected]    2021-04-01    1      foo     bar    foobar
7    [email protected]    2021-03-01    1      hab     bah    habbah

newTable (with newest date)
id   email      tDate        cValue  col4    col5   col6
==========================================================
1    [email protected]    2021-04-01    3      foo     bar    foobar
2    [email protected]    2021-03-01    2      bar     foo    barfoo
3    [email protected]    2021-01-15    1      bah     hab    bahhab
4    [email protected]    2021-03-01    2      hab     bah    habbah

I think the below works (I have tested the select part, but not the full insert yet), but I don't know how to process the GROUP BY email based on tDate being the oldest or newest. I haven't decided yet where the oldest or newest data should be the final record - but I still need to know how to grab by the date.

INSERT INTO newTable (email,tDate,cValue,col4,col5,col6) 
SELECT 
   email,
   tDate,
   COUNT(*) as  tValue,
   col4,
   col5,
   col6 
FROM tempTable 
GROUP BY email ;

When I do an ORDER BY tDate DESC - it is simply ordering the output, not actually ordering the records before the GROUP BY.

CodePudding user response:

As Luuk has already explained, this is a non-deterministic query and you should read about MySQL Handling of GROUP BY. ONLY_FULL_GROUP_BY is a good thing and should be enabled.

The idea is to get the MAX tDate and COUNT per email and then use that to join back to the original table values -

SELECT t1.email, t1.tDate, t2.cnt AS cValue, t1.col4, t1.col5, t1.col6
FROM tempTable t1
JOIN (
    SELECT email, MAX(tDate) AS maxDate, COUNT(*) as cnt
    FROM tempTable
    GROUP BY email
) t2 ON t1.email = t2.email AND t1.tDate = t2.maxDate;

Alternatively, if using MySQL 8 you can use window functions, specifically ROW_NUMBER() to assign row numbers per partition ordered by tDate descending and then select where row_number = 1

SELECT email, tDate, cValue, col4, col5, col6
FROM (
    SELECT
        email,
        tDate,
        COUNT(*) OVER (PARTITION BY email) AS cValue,
        col4,
        col5,
        col6 ,
        ROW_NUMBER() OVER (PARTITION BY email ORDER BY tDate DESC) AS rn
    FROM tempTable
) t
WHERE t.rn = 1
ORDER BY email ASC;

CodePudding user response:

Kudos to both @nnichols and @Luuk for clueing me into the use of MIN/MAX in the query. I should have thought of it myself but...well, didn't. lol.

Just using the MIN or MAX gave me what I was looking for....but with a caveat. See caveat noted below. In my case though, I don't have to worry about the caveat so simply doing the MIN/MAX in my current query is all I needed.

INSERT INTO newTable (email,tDate,cValue,col4,col5,col6) 
SELECT 
   email,
   MAX(tDate),
   COUNT(*) as  tValue,
   col4,
   col5,
   col6 
FROM tempTable 
GROUP BY email ;

Outputted:

newTable (with newest date)
id   email      tDate        cValue  col4    col5   col6
==========================================================
1    [email protected]    2021-04-01    3      foo     bar    foobar
2    [email protected]    2021-03-01    2      bar     foo    barfoo
3    [email protected]    2021-01-15    1      bah     hab    bahhab
4    [email protected]    2021-03-01    2      hab     bah    habbah

Or with MIN:

newTable (with newest date)
id   email      tDate        cValue  col4    col5   col6
==========================================================
1    [email protected]    2021-01-01    3      foo     bar    foobar
2    [email protected]    2021-01-02    2      bar     foo    barfoo
3    [email protected]    2021-01-15    1      bah     hab    bahhab
4    [email protected]    2021-02-15    2      hab     bah    habbah

CAVEAT: the MAX(tDate) being selected does not mean the rest of the columns being selected are also from the same row. I saw instances where if col4 for the same email address had different values, that the MIN/MAX(tDate) returned the same col4 value even though the returned tDate was different each time.

If I truly needed all the data from the same tDate row (MIN or MAX), then I would need to do query more like @nnichols answered with.

  • Related