Home > Software design >  Grabbing MySQL counts from loosely related tables with complex structure
Grabbing MySQL counts from loosely related tables with complex structure

Time:02-10

MySQL here. I have the following data model in a Q&A database:

[questions]
===
question_id : PK           # ex: 12345
question_value : VARCHAR   # ex: 'What is your favorite food?'

[answers]
===
answer_id : PK              # ex: 23456
question_id : FK            # points to a questions.question_id
answer_value : VARCHAR      # ex: 'Pizza', 'French Fries', etc.

[user_answers]
===
user_answer_id : PK         # ex: 34567
question_id : FK            # points to a questions.question_id
user_answer_value : VARCHAR # ex: 'Pizza', 'French Fries', etc.

...where [questions] is a table containing all the questions I will ask users, [answers] is a table containing all the possible answers for each question, and [user_answers] are the actual answers users give for a particular question. So the application uses [questions] and [answers] to present a user with a question & possible answers, then the user selects one of those answers, and that result gets written to the [user_answers] table.

One very important thing to note is that the application will guarantee that there are never any "orphaned" [user_answers], meaning, there will always be an [answers] record whose [answer_value] matches a [user_answers.user_answer_value]. So if there is a [user_answers.user_answer_value] in the database under a particular question_id, there will always be a matching [answers.answer_value]. Hence it is safe to match to the two "value fields" against each other in a query and not lose any data.

I realize this is not a great data model. But it is critical to point out that I simply cannot change it; its locked in and beyond my ability to make changes to, I'm simply stuck with it. And so unfortunately as much as I will want to, I cannot accept any answers that require changes to the data model.

Now then, I am trying to write a SQL query that -- for a given question_id -- returns the following columns:

  1. the answers.answer_id as id
  2. the answers.answer_value as answer
  3. a count of how many user_answers existed for that particular answers.answer_id, using the value fields as a way of determining the match (i.e. "'Pizza' == 'Pizza'"), as popularity

So using the "favorite food" example from above, say that was the only question in the database, and the three possible answers were "Pizza", "French Fries" and "Cheeseburgers". Say 50 users had answered "Pizza", 65 users had answered "French Fries", and 12 users had answered "Cheeseburgers". Then the query would return results like so:

id      |     answer      |   popularity
==========================================
123     |   Pizza         |   50
456     |   French Fries  |   65
789     |   Cheeseburgers |   12

Important to note: if an answer has never been selected by any user yet, then its count should be zero (0).

My best attempt so far at this query is:

SELECT
  a.answer_id as id,
  a.answer_value as answer,
  count(ua.answer_id) as popularity
FROM
  answers a
INNER JOIN
  user_answers ua
ON
  a.question_id = ua.question_id
WHERE
  a.question_id = ?   # provided by application
  AND
  a.answer_value = ua.answer_value

However this doesn't look right to me. For one I don't believe an INNER JOIN is appropriate. I also believe I need to do a GROUP BY but I'm unsure of which table/column to group the count by.

Its important to note that in reality there will be hundreds of questions in this DB, each with 4 - 8 answers, and there will be hundreds (possibly thousands) of users providing answers, so its important that this query (which provides popularity counts for each question's answer) only include counts for user answers matching answers that are tied to the same "parent" question ID. Can anyone spot where I'm going awry?

CodePudding user response:

I guess it is as simple as:

SELECT answers.answer_id, answers.answer_value, COUNT(user_answers.user_answer_id)
FROM answers
LEFT JOIN user_answers ON answers.answer_value = user_answers.user_answer_value
WHERE answers.question_id = ?
GROUP BY answers.answer_id, answers.answer_value

The LEFT JOIN ensures that zero counts are included.

PS: The database design has some issues. I would rather change it to user_answers(user_answer_id, answer_id).

CodePudding user response:

This will work for your requirement

select a.answer_id as id,a.answer_value as answer,count(user_answer_value) as popularity from user_answers ua 
join answers a on a.answer_id = ua.user_answer_value where a.question_id= 1 
group by a.answer_id, a.answer_value order by count(user_answer_value) desc
  • Related