As the subject states, I have a table with multiple rows per user and I need to get a count of all rows for each user with the data from the row with the oldest date AND then insert one new row, with the count value, into another table.
I am importing an attendance list CSV file into a temporary table...which is working fine. But now I need to process all of the individual records into a single summary record to be added to the final table.
tempTable:
id email tDate cValue col4 col5 col6
==========================================================
1 [email protected] 2021-01-01 1 foo bar foobar
2 [email protected] 2021-01-02 1 bar foo barfoo
3 [email protected] 2021-02-01 1 foo bar foobar
4 [email protected] 2021-01-15 1 bah hab bahhab
5 [email protected] 2021-02-15 1 hab bah habbah
5 [email protected] 2021-03-01 1 bar foo barfoo
6 [email protected] 2021-04-01 1 foo bar foobar
7 [email protected] 2021-03-01 1 hab bah habbah
newTable (with newest date)
id email tDate cValue col4 col5 col6
==========================================================
1 [email protected] 2021-04-01 3 foo bar foobar
2 [email protected] 2021-03-01 2 bar foo barfoo
3 [email protected] 2021-01-15 1 bah hab bahhab
4 [email protected] 2021-03-01 2 hab bah habbah
I think the below works (I have tested the select
part, but not the full insert
yet), but I don't know how to process the GROUP BY email
based on tDate being the oldest or newest. I haven't decided yet where the oldest or newest data should be the final record - but I still need to know how to grab by the date.
INSERT INTO newTable (email,tDate,cValue,col4,col5,col6)
SELECT
email,
tDate,
COUNT(*) as tValue,
col4,
col5,
col6
FROM tempTable
GROUP BY email ;
When I do an ORDER BY tDate DESC
- it is simply ordering the output, not actually ordering the records before the GROUP BY.
CodePudding user response:
As Luuk has already explained, this is a non-deterministic query and you should read about MySQL Handling of GROUP BY. ONLY_FULL_GROUP_BY is a good thing and should be enabled.
The idea is to get the MAX tDate and COUNT per email and then use that to join back to the original table values -
SELECT t1.email, t1.tDate, t2.cnt AS cValue, t1.col4, t1.col5, t1.col6
FROM tempTable t1
JOIN (
SELECT email, MAX(tDate) AS maxDate, COUNT(*) as cnt
FROM tempTable
GROUP BY email
) t2 ON t1.email = t2.email AND t1.tDate = t2.maxDate;
Alternatively, if using MySQL 8 you can use window functions, specifically ROW_NUMBER() to assign row numbers per partition ordered by tDate descending and then select where row_number = 1
SELECT email, tDate, cValue, col4, col5, col6
FROM (
SELECT
email,
tDate,
COUNT(*) OVER (PARTITION BY email) AS cValue,
col4,
col5,
col6 ,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY tDate DESC) AS rn
FROM tempTable
) t
WHERE t.rn = 1
ORDER BY email ASC;
CodePudding user response:
Kudos to both @nnichols and @Luuk for clueing me into the use of MIN/MAX in the query. I should have thought of it myself but...well, didn't. lol.
Just using the MIN
or MAX
gave me what I was looking for....but with a caveat. See caveat noted below. In my case though, I don't have to worry about the caveat so simply doing the MIN/MAX
in my current query is all I needed.
INSERT INTO newTable (email,tDate,cValue,col4,col5,col6)
SELECT
email,
MAX(tDate),
COUNT(*) as tValue,
col4,
col5,
col6
FROM tempTable
GROUP BY email ;
Outputted:
newTable (with newest date)
id email tDate cValue col4 col5 col6
==========================================================
1 [email protected] 2021-04-01 3 foo bar foobar
2 [email protected] 2021-03-01 2 bar foo barfoo
3 [email protected] 2021-01-15 1 bah hab bahhab
4 [email protected] 2021-03-01 2 hab bah habbah
Or with MIN
:
newTable (with newest date)
id email tDate cValue col4 col5 col6
==========================================================
1 [email protected] 2021-01-01 3 foo bar foobar
2 [email protected] 2021-01-02 2 bar foo barfoo
3 [email protected] 2021-01-15 1 bah hab bahhab
4 [email protected] 2021-02-15 2 hab bah habbah
CAVEAT: the MAX(tDate)
being selected does not mean the rest of the columns being selected are also from the same row. I saw instances where if col4 for the same email address had different values, that the MIN/MAX(tDate)
returned the same col4 value even though the returned tDate
was different each time.
If I truly needed all the data from the same tDate row (MIN or MAX), then I would need to do query more like @nnichols answered with.