Home > database >  How to get the next value in a separate partition in a SQL query?
How to get the next value in a separate partition in a SQL query?

Time:10-23

I have a single database table affiliations in the following format:

author_id article_id institution publication_date
1 1 institution_1 2010-01-01
1 1 institution_2 2010-01-01
1 2 institution_2 2012-01-01
1 3 institution_2 2014-01-01
2 2 institution_3 2012-01-01
2 4 institution_3 2013-01-01

My goal is to infer periods of affiliations between a given author and their institutions, by combining publication dates across the author's consecutive articles where the same institution is on those articles and "consecutive" means published consecutively date-wise. For the most recent article of a given author, it is assumed that the author is still affiliated with the institution/s from that article. So for the data above, for example, I'm looking to return something like the following:

author_id institution start_date end_date
1 institution_1 2010-01-01 2012-01-01
1 institution_2 2010-01-01 <current_date>
2 institution_3 2012-01-01 <current_date>

Importantly, an author can be affiliated with more than one institution at the same time, i.e. affiliations can overlap for the same author.

I have tried various combinations of leads and partitions in SQL, but one of the issues I'm having is I'm not able to select the next value that is in a different partition (which I think I'll need to do to get the next publication date for a given author on a different article, for example).

Does anyone have any suggestions on how I might achieve the above in a fairly efficient manner (bearing in mind this will be part of a CTE in a larger query that involves millions of rows)?

CodePudding user response:

One way to resolve it is by adding additional subgroup when institution changes:

WITH cte AS (
   SELECT *,
    CONDITIONAL_CHANGE_EVENT(institution) OVER(PARTITION BY author_id 
                                               ORDER BY publication_date) AS subgrp
   FROM tab
)
SELECT author_id, institution,
    MIN(publication_date) AS start_date,
    MAX(publication_date) AS end_date
FROM cte
GROUP BY author_id, institution,subgrp
ORDER BY author_id,start_date;

Output:

enter image description here

CodePudding user response:

Try the following left join query:

SELECT D.author_id, D.institution, D.start_date, 
       COALESCE(MIN(T.publication_date), CURRENT_DATE) AS end_date
FROM
  (
    SELECT author_id, institution, MIN(publication_date) AS start_date
    FROM affiliations
    GROUP BY author_id, institution
  ) D
LEFT JOIN affiliations T
ON D.author_id = T.author_id AND D.institution <> T.institution
   AND D.start_date < T.publication_date
GROUP BY D.author_id, D.institution, D.start_date
ORDER BY D.author_id, D.institution

The output according to your input:

enter image description here

  • Related