How to query first 5 projects for each person-CodePudding

I have a dataset of around 50k records. The important fields are "date", "name" and "project ID". My challenge is I need to pull the first 5 projects for each person. There are around 500 people in the dataset. I know how to get the earilest projects, or the earliest projects for a specific individual, but I don't know how to get all the data for the earliest projects for all individuals. Any suggestions?

My desired output would be something like this:

NAME, DATE, PROJECT_ID
sally, 12/2/2020, 12432
sally, 12/6/2020, 13433
sally, 12/16/2020, 14453
sally, 1/8/2021, 14468
sally, 2/6/2021, 15496
jimmy, 11/15/2020, 24531
jimmy, 3/21/2021, 42322
jimmy, 5/13/2021, 44332

...

I'm using SQL in Google Big Query. I haven't gotten far because the logic is baffling me. And as you may be able to tell, my sql skills are elementary at best - but I'm trying! :)

CodePudding user response：

You can use the ROW_NUMBER function with the PARTITION BY clause as a subquery, like so, and get only records with a row number <= 5 as it's ordering by date, this will get the first 5 records for each user, if you needed the 5 most recent records for each user, you would do ORDER BY DATE DESC, yourtable would be the name of your actual database table:

--Partitioning by Name is not consistent if more than one user
--shares the same name, if you have a user id which you can use to
--partition it would be better, but the following will get you moving
--in the right direction.
SELECT NAME, DATE, PROJECT_ID
FROM
(
    SELECT NAME, DATE, PROJECT_ID, ROW_NUMBER ()
    OVER (PARTITION BY Name ORDER BY DATE) AS rownum
    FROM yourtable
) 
where rownum <= 5

More information on the ROW_NUMBER function: bigquery-row-number-function