So I have the following key/value pair table, where users submit data through a form and each question on the form is added to the table here as an individual row. Submission_id
identifies each form submission.
---- --------------- -------------- --------
| id | submission_id | key | value |
---- --------------- -------------- --------
| 1 | 10 | manufacturer | Apple |
| 2 | 10 | model | 5s |
| 3 | 10 | firstname | Paul |
| 4 | 15 | manufacturer | Apple |
| 5 | 15 | model | 5s |
| 6 | 15 | firstname | Paul |
| 7 | 20 | manufacturer | Apple |
| 8 | 20 | model | 5s |
| 9 | 20 | firstname | Andrew |
---- --------------- -------------- --------
From the data above you can see that the submissions with id of 10 and 15 both have the same values (just different submission id). This is basically because a user has submitted the same form twice and so is a duplicate.
Im trying to find a way to order these table where the any duplicate submissions appear together in order. Given the above table I am trying to build a query that gives me the result as below:
---------------
| submission_id |
---------------
| 10 |
| 15 |
| 20 |
---------------
So I want to check to see if a submission where the manufacturer
, model
and firstname
keys have the same value. If it does then these get the submission id and place them adjacently in the result. In the actual table there are other keys, but I only want to match duplicates based on these 3 keys (manufacturer, model, firstname).
I’ve been going back and forth to the drawing board quite some time now and have tried looking for some possible solutions but cannot get something reliable.
CodePudding user response:
That's not a key value table. It's usually called an Entity-Attribute-Value table/relation/pattern.
Looking at the problem, it would be trivial if the table were laid out in conventional 1st 2nd Normal form - you just do a join on the values, group by those and take a count....
SELECT manufacturer, model, firstname, COUNT(DISTINCT submission_id)
FROM atable
GROUP BY manufacturer, model, firstname
HAVING COUNT(DISTINCT submission_id)>1;
Or a join....
SELECT a.manufacturer, a.model, a.firstname
, a.submission_id, b.submission_id
FROM atable a
JOIN atable b
ON a.manufacturer=b.manufacturer
AND a.model=b.model
AND a.firstname=b.firstname
WHERE a.submission_id<b.submission_id
;
Or using sorting and comparing adjacent rows....
SELECT *
FROM
(
SELECT @prev.submission_id AS prev_submission_id
, @prev.manufacturer AS prev_manufacturer
, @prev.model AS prev_model
, @prev.firstname AS pref_firstname
, a.submission_id
, a.manufacturer
, a.model
, set @prev.submission_id:=a.submission_id as currsid
, set @prev.manufacturer:=a.manufacturer as currman
, set @prev.model:=a.model as currmodel
, set @prev.firstname=a.forstname as currname
FROM atable
ORDER BY manufacturer, model, firstname, submission_id
)
WHERE prev_manufacturer=manufacturer
AND prev_model=model
AND prev_firstname=firstname
AND prev_submission_id<>submission_id;
So the solution is to simply make your data look like a normal relation....
SELECT ilv.values
, COUNT(ilv.submission_id)
, GROUP_CONCAT(ilv.submission_id)
FROM
(SELECT a.submission_id
, GROUP_CONCAT(CONCAT(a.key, '=',a.value)) AS values
FROM atable a
GROUP BY a.submission_id
) ilv
GROUP BY ilv.values
HAVING COUNT(ilv.submission_id)>1;
Hopefully the join and sequence based solutions should now be obvious.