Home > Software design >  SQL to select the 'first' date a project was made inactive for all projects
SQL to select the 'first' date a project was made inactive for all projects

Time:08-05

I am trying to work out the SQL I would need to select certain records, here is an example of what I'm trying to do:

Project number Active/Inactive Date
1 A 1/1/20
1 I 3/1/20
1 A 5/1/20
1 I 7/1/20
1 I 9/1/20
2 I 1/1/19
2 A 5/1/19
3 A 1/3/20
3 I 3/3/20
3 I 5/3/20

Note: A=Active project, I=Inactive.

What I would like to do is for each project where the project is currently inactive (i.e. the latest date for the project in the above table is set to I), return the row of the longest time ago it was made inactive, but NOT before it was last active (hope this is understandable!). So for the above table the following would be returned:

Project number Active/Inactive Date
1 I 7/1/20
3 I 3/3/20

So proj number 1 is inactive and the earliest time it was made inactive (after the last time it was active) is 7/1/20. Project 2 is not selected as it is currently active. Project 3 is inactive and the earliest time it was made inactive (after the last time it was active) is 3/3/20.

Thanks.

CodePudding user response:

You could use the 'row_number' function to help you.

create TABLE #PROJECT(ProjectNumber int, [Status] varcha(1), [Date] date)

INSERT INTO #PROJECT VALUES
(1  ,'A'    ,'1/1/20'),
(1  ,'I'    ,'3/1/20'),
(1  ,'A'    ,'5/1/20'),
(1  ,'I'    ,'7/1/20'),
(1  ,'I'    ,'9/1/20'),
(2  ,'I'    ,'1/1/19'),
(2  ,'A'    ,'5/1/19'),
(3  ,'A'    ,'1/3/20'),
(3  ,'I'    ,'3/3/20'),
(3  ,'I'    ,'5/3/20')

select * from 

(SELECT 
    row_number() over (partition by projectNumber order by [date]) as [index]
    ,* 
FROM 
    #PROJECT 

WHERE 
    [STATUS] = 'I'
) as a where [index] = 1

CodePudding user response:

Using some effective date joins, this should work. I am using SQL Server. Create your tables and set up the same data set you provided:

CREATE TABLE dbo.PROJECTS
(
PROJ_NUM int NULL,
STTS char(1) NULL,
STTS_DT date NULL
)  ON [PRIMARY]
GO

INSERT INTO dbo.PROJECTS values (1, 'A', '1/1/20');
INSERT INTO dbo.PROJECTS values (1, 'I', '3/1/20');
INSERT INTO dbo.PROJECTS values (1, 'A', '5/1/20');
INSERT INTO dbo.PROJECTS values (1, 'I', '7/1/20');
INSERT INTO dbo.PROJECTS values (1, 'I', '9/1/20');
INSERT INTO dbo.PROJECTS values (2, 'I', '1/1/19');
INSERT INTO dbo.PROJECTS values (2, 'A', '5/1/19');
INSERT INTO dbo.PROJECTS values (3, 'A', '1/3/20');
INSERT INTO dbo.PROJECTS values (3, 'I', '3/3/20');
INSERT INTO dbo.PROJECTS values (3, 'I', '5/3/20');

Write a sub-query that filters out just to the projects that are INACTIVE:

-- sub-query that gives you projects that are inactive
SELECT PROJ_NUM, STTS, STTS_DT FROM dbo.PROJECTS CURRSTTS
WHERE STTS_DT = (SELECT MAX(STTS_DT) FROM dbo.PROJECTS ALLP WHERE ALLP.PROJ_NUM = CURRSTTS.PROJ_NUM)
    AND CURRSTTS.STTS = 'I'
;

Write another sub-query that provides you the last active status date for each project:

-- sub-query that gives you last active status date for each project
SELECT PROJ_NUM, STTS, STTS_DT FROM dbo.PROJECTS LASTACTV
WHERE STTS_DT = (SELECT MAX(STTS_DT) FROM dbo.PROJECTS ALLP WHERE ALLP.PROJ_NUM = LASTACTV.PROJ_NUM AND ALLP.STTS = 'A')
;

Combine those two sub-queries into a query that gives you the list of inactive projects with their last active status date:

-- sub-query using the 2 above to show only inactive projects with last active stts date
SELECT CURRSTTS.PROJ_NUM, CURRSTTS.STTS, CURRSTTS.STTS_DT, LASTACTV.STTS_DT AS LASTACTV_STTS_DT FROM dbo.PROJECTS CURRSTTS
INNER JOIN 
    (SELECT PROJ_NUM, STTS, STTS_DT FROM dbo.PROJECTS LASTACTV
     WHERE STTS_DT = (SELECT MAX(STTS_DT) FROM dbo.PROJECTS ALLP WHERE ALLP.PROJ_NUM = LASTACTV.PROJ_NUM AND ALLP.STTS = 'A'))
     LASTACTV ON CURRSTTS.PROJ_NUM = LASTACTV.PROJ_NUM
WHERE CURRSTTS.STTS_DT = (SELECT MAX(STTS_DT) FROM dbo.PROJECTS ALLP WHERE ALLP.PROJ_NUM = CURRSTTS.PROJ_NUM)
    AND CURRSTTS.STTS = 'I'

Add one more layer to the query that selects the MIN(STTS_DT) that is greater than the LASTACTV_STTS_DT:

-- final query that uses above sub-query
SELECT P.PROJ_NUM, P.STTS, P.STTS_DT
FROM dbo.PROJECTS P
INNER JOIN (
    SELECT CURRSTTS.PROJ_NUM, CURRSTTS.STTS, CURRSTTS.STTS_DT, LASTACTV.STTS_DT AS LASTACTV_STTS_DT FROM dbo.PROJECTS CURRSTTS
    INNER JOIN 
        (SELECT PROJ_NUM, STTS, STTS_DT FROM dbo.PROJECTS LASTACTV
         WHERE STTS_DT = (SELECT MAX(STTS_DT) FROM dbo.PROJECTS ALLP WHERE ALLP.PROJ_NUM = LASTACTV.PROJ_NUM AND ALLP.STTS = 'A'))
         LASTACTV ON CURRSTTS.PROJ_NUM = LASTACTV.PROJ_NUM
    WHERE CURRSTTS.STTS_DT = (SELECT MAX(STTS_DT) FROM dbo.PROJECTS ALLP WHERE ALLP.PROJ_NUM = CURRSTTS.PROJ_NUM)
        AND CURRSTTS.STTS = 'I'
) SUB ON SUB.PROJ_NUM = P.PROJ_NUM
WHERE P.STTS_DT = (SELECT MIN(STTS_DT) FROM dbo.PROJECTS ALLP WHERE ALLP.PROJ_NUM = P.PROJ_NUM AND ALLP.STTS_DT > SUB.LASTACTV_STTS_DT)

The result I get back matches your desired result:

enter image description here

CodePudding user response:

"Greatest n-per group" is the thing to look up when you run accross a problem like this again. Here is a query that will get what you need in postgresSQL.

I realized I changed your column to a boolean, but you will get the gist.

with most_recent_projects as (
    select project_number, max(date) date from testtable group by project_number
),
currently_inactive_projects as (
    select t.project_number, t.date from testtable t join most_recent_projects mrp on t.project_number = mrp.project_number and t.date = mrp.date where not t.active
),
last_active_date as (
    select project_number, date from (
    select t.project_number, rank() OVER (
          PARTITION BY t.project_number
          ORDER BY t.date DESC), t.date
     from currently_inactive_projects cip join testtable t on t.project_number = cip.project_number  where t.active) t1 where rank = 1
)
-- oldest inactive -- ie, result
select t.project_number, t.active, min(t.date) from last_active_date lad join testtable t on lad.project_number = t.project_number and t.date > lad.date group by t.project_number, t.active;

CodePudding user response:

This is a variation of "gaps and islands" problem.

The query may be like this

SELECT 
    num,
    status,
    MIN(date) AS date
FROM (  
    SELECT
        *,
        MAX(group_id) OVER (PARTITION BY num) AS max_group_id
    FROM (    
        SELECT 
            *,
            SUM(CASE WHEN status = prev_status THEN 0 ELSE 1 END) OVER (PARTITION BY num ORDER BY date) AS group_id
        FROM (   
            SELECT
                *,
                LAG(status) OVER (PARTITION BY num ORDER BY date) AS prev_status
            FROM projects
        ) groups
    ) islands
) q
WHERE status = 'I' AND group_id = max_group_id
GROUP BY num, status
ORDER BY num

Another approach using CTEs

WITH last_status AS (    
    SELECT
        *
    FROM (
        SELECT 
            *,
            ROW_NUMBER() OVER (PARTITION BY num ORDER BY date DESC) AS rn
        FROM projects        
    ) rns
    WHERE rn = 1
),
last_active AS (    
    SELECT
        num,
        MAX(date) AS date
    FROM projects
    WHERE status = 'A'
    GROUP BY num
),
last_inactive AS (
    SELECT 
        p.num,
        MIN(p.date) AS date
    FROM projects p
    WHERE p.status = 'I' 
          AND (
              EXISTS (
                  SELECT 1 FROM last_active la
                  WHERE la.num = p.num AND la.date < p.date
              )
              OR NOT EXISTS (
                  SELECT 1 FROM last_active la
                  WHERE la.num = p.num
              )
          )
    GROUP BY num
)
SELECT
    ls.num,
    ls.status,
    li.date
FROM last_status ls
JOIN last_inactive li ON li.num = ls.num
WHERE ls.status = 'I'

You can check a working demo with both queries here

  •  Tags:  
  • sql
  • Related