I have a dataset of around 50k records. The important fields are "date", "name" and "project ID". My challenge is I need to pull the first 5 projects for each person. There are around 500 people in the dataset. I know how to get the earilest projects, or the earliest projects for a specific individual, but I don't know how to get all the data for the earliest projects for all individuals. Any suggestions?
My desired output would be something like this:
NAME, DATE, PROJECT_ID
sally, 12/2/2020, 12432
sally, 12/6/2020, 13433
sally, 12/16/2020, 14453
sally, 1/8/2021, 14468
sally, 2/6/2021, 15496
jimmy, 11/15/2020, 24531
jimmy, 3/21/2021, 42322
jimmy, 5/13/2021, 44332
...
I'm using SQL in Google Big Query. I haven't gotten far because the logic is baffling me. And as you may be able to tell, my sql skills are elementary at best - but I'm trying! :)
CodePudding user response:
You can use the ROW_NUMBER
function with the PARTITION BY
clause as a subquery, like so, and get only records with a row number <= 5 as it's ordering by date, this will get the first 5 records for each user, if you needed the 5 most recent records for each user, you would do ORDER BY DATE DESC
, yourtable
would be the name of your actual database table:
--Partitioning by Name is not consistent if more than one user
--shares the same name, if you have a user id which you can use to
--partition it would be better, but the following will get you moving
--in the right direction.
SELECT NAME, DATE, PROJECT_ID
FROM
(
SELECT NAME, DATE, PROJECT_ID, ROW_NUMBER ()
OVER (PARTITION BY Name ORDER BY DATE) AS rownum
FROM yourtable
)
where rownum <= 5
More information on the ROW_NUMBER
function: bigquery-row-number-function