mysql select count of distinct rows based on newest date and insert into another table-CodePudding

As the subject states, I have a table with multiple rows per user and I need to get a count of all rows for each user with the data from the row with the oldest date AND then insert one new row, with the count value, into another table.

I am importing an attendance list CSV file into a temporary table...which is working fine. But now I need to process all of the individual records into a single summary record to be added to the final table.

tempTable:
id   email      tDate        cValue  col4    col5   col6
==========================================================
1    [email protected]    2021-01-01    1      foo     bar    foobar
2    [email protected]    2021-01-02    1      bar     foo    barfoo      
3    [email protected]    2021-02-01    1      foo     bar    foobar
4    [email protected]    2021-01-15    1      bah     hab    bahhab
5    [email protected]    2021-02-15    1      hab     bah    habbah
5    [email protected]    2021-03-01    1      bar     foo    barfoo
6    [email protected]    2021-04-01    1      foo     bar    foobar
7    [email protected]    2021-03-01    1      hab     bah    habbah

newTable (with newest date)
id   email      tDate        cValue  col4    col5   col6
==========================================================
1    [email protected]    2021-04-01    3      foo     bar    foobar
2    [email protected]    2021-03-01    2      bar     foo    barfoo
3    [email protected]    2021-01-15    1      bah     hab    bahhab
4    [email protected]    2021-03-01    2      hab     bah    habbah

I think the below works (I have tested the select part, but not the full insert yet), but I don't know how to process the GROUP BY email based on tDate being the oldest or newest. I haven't decided yet where the oldest or newest data should be the final record - but I still need to know how to grab by the date.

INSERT INTO newTable (email,tDate,cValue,col4,col5,col6) 
SELECT 
   email,
   tDate,
   COUNT(*) as  tValue,
   col4,
   col5,
   col6 
FROM tempTable 
GROUP BY email ;

When I do an ORDER BY tDate DESC - it is simply ordering the output, not actually ordering the records before the GROUP BY.

CodePudding user response：

As Luuk has already explained, this is a non-deterministic query and you should read about MySQL Handling of GROUP BY. ONLY_FULL_GROUP_BY is a good thing and should be enabled.

The idea is to get the MAX tDate and COUNT per email and then use that to join back to the original table values -

SELECT t1.email, t1.tDate, t2.cnt AS cValue, t1.col4, t1.col5, t1.col6
FROM tempTable t1
JOIN (
    SELECT email, MAX(tDate) AS maxDate, COUNT(*) as cnt
    FROM tempTable
    GROUP BY email
) t2 ON t1.email = t2.email AND t1.tDate = t2.maxDate;

Alternatively, if using MySQL 8 you can use window functions, specifically ROW_NUMBER() to assign row numbers per partition ordered by tDate descending and then select where row_number = 1

SELECT email, tDate, cValue, col4, col5, col6
FROM (
    SELECT
        email,
        tDate,
        COUNT(*) OVER (PARTITION BY email) AS cValue,
        col4,
        col5,
        col6 ,
        ROW_NUMBER() OVER (PARTITION BY email ORDER BY tDate DESC) AS rn
    FROM tempTable
) t
WHERE t.rn = 1
ORDER BY email ASC;

CodePudding user response：

Kudos to both @nnichols and @Luuk for clueing me into the use of MIN/MAX in the query. I should have thought of it myself but...well, didn't. lol.

Just using the MIN or MAX gave me what I was looking for....but with a caveat. See caveat noted below. In my case though, I don't have to worry about the caveat so simply doing the MIN/MAX in my current query is all I needed.

INSERT INTO newTable (email,tDate,cValue,col4,col5,col6) 
SELECT 
   email,
   MAX(tDate),
   COUNT(*) as  tValue,
   col4,
   col5,
   col6 
FROM tempTable 
GROUP BY email ;

Outputted:

newTable (with newest date)
id   email      tDate        cValue  col4    col5   col6
==========================================================
1    [email protected]    2021-04-01    3      foo     bar    foobar
2    [email protected]    2021-03-01    2      bar     foo    barfoo
3    [email protected]    2021-01-15    1      bah     hab    bahhab
4    [email protected]    2021-03-01    2      hab     bah    habbah

Or with MIN:

newTable (with newest date)
id   email      tDate        cValue  col4    col5   col6
==========================================================
1    [email protected]    2021-01-01    3      foo     bar    foobar
2    [email protected]    2021-01-02    2      bar     foo    barfoo
3    [email protected]    2021-01-15    1      bah     hab    bahhab
4    [email protected]    2021-02-15    2      hab     bah    habbah

CAVEAT: the MAX(tDate) being selected does not mean the rest of the columns being selected are also from the same row. I saw instances where if col4 for the same email address had different values, that the MIN/MAX(tDate) returned the same col4 value even though the returned tDate was different each time.

If I truly needed all the data from the same tDate row (MIN or MAX), then I would need to do query more like @nnichols answered with.