Home > Enterprise >  Filtering out early entries meeting a condition on a per user basis
Filtering out early entries meeting a condition on a per user basis

Time:12-24

I am new to SQL and want to filter a database in a way that doesn’t quite map to any of the examples I’ve read. I’m using mySQL with MariaDB.

The table is game result reports that takes the following structure:

  • id (unique integer id for each report, primary key)
  • user_id (integer id allocated to each player)
  • day (integer, the identity of the daily puzzle being reported)
  • result (integer score)
  • submitted (timestamp of the report submission)

The day puzzle ticks over at midnight local time so two reports can be submitted at the same time but be from different days.

I want to be able to find the average score reported for each day BUT I want to exclude any user’s score if they’ve never got a particular score prior to that (to eliminate complete newbies and people who are unusually bad at the game). However I can’t work out how to layer in this exclusion.

Where I’ve got to is that I can get the earliest successful game for each user with this query:

SELECT id, user_id, MIN(submitted), result FROM ‘results_daily’ WHERE result > 5 GROUP BY user_id

(Let’s call that output1)

However I can’t see how to apply this to a set of day results as a filter (so only include a result in my average calculation where a user features in output1 AND their daily report has been submitted on or after the date for that user in output1).

It feels like it might be some kind of JOIN operation but I can’t wrap my head around it. Can anyone help?

** EDITED TO ADD:

Ok I think I've got it, although my solution uses a function and I'm not sure if that's the most efficient way to do this or the most SQL-y way. Instinctively it feels like this should be possible without a function but I'm definitely not practiced enough to work it out if it is! O. Jones's answer set me off down the right path I just needed to refine the excluded set with a function. So now my query looks like this:

SELECT day, 
       AVG(result) average_score,
       COUNT(*) number_of_plays,
       COUNT(DISTINCT user_id) number_of_non_n00b_players
  FROM results_daily
 WHERE user_id NOT IN (
             SELECT user_id 
               FROM results_daily
                WHERE submitted < GetEureka(user_id)
              GROUP BY user_id )           
 GROUP BY day; 

and my function GetEureka() looks like this:

DECLARE eureka TIMESTAMP DEFAULT CURRENT_TIMESTAMP;

SELECT
    MIN(submitted) INTO eureka
FROM
    results_daily
WHERE
    user_id = user
    AND
    result >= 5
GROUP BY
    user_id;
RETURN eureka;

CodePudding user response:

In SQL think about sets. You need the set of user_id values of all n00bz meeting either of these two criteria:

  • No score greater than 5.
  • Only one play of the game.

Then, when you compute your averages you exclude rows with those user ids.

So let's get you the n00bz, with the magic of the GROUP BY and HAVING clauses.

       SELECT user_id 
         FROM results_daily
        GROUP BY user_id 
       HAVING COUNT(*) = 1 OR MAX(result) <= 5

Now we can run your stats.

SELECT day, 
       AVG(score) average_score,
       COUNT(*) number_of_plays,
       COUNT(DISTINCT user_id) number_of_non_n00b_players
  FROM results_daily
 WHERE user_id NOT IN (
             SELECT user_id 
               FROM results_daily
              GROUP BY user_id 
             HAVING COUNT(*) = 1 OR MAX(result) <= 5 )           
 GROUP BY day; 

The structured in stuctured query language comes from this use of nested subqueries to define sets of rows and work with them.

  • Related