I have a single database table affiliations
in the following format:
author_id | article_id | institution | publication_date |
---|---|---|---|
1 | 1 | institution_1 | 2010-01-01 |
1 | 1 | institution_2 | 2010-01-01 |
1 | 2 | institution_2 | 2012-01-01 |
1 | 3 | institution_2 | 2014-01-01 |
2 | 2 | institution_3 | 2012-01-01 |
2 | 4 | institution_3 | 2013-01-01 |
My goal is to infer periods of affiliations between a given author and their institutions, by combining publication dates across the author's consecutive articles where the same institution is on those articles and "consecutive" means published consecutively date-wise. For the most recent article of a given author, it is assumed that the author is still affiliated with the institution/s from that article. So for the data above, for example, I'm looking to return something like the following:
author_id | institution | start_date | end_date |
---|---|---|---|
1 | institution_1 | 2010-01-01 | 2012-01-01 |
1 | institution_2 | 2010-01-01 | <current_date> |
2 | institution_3 | 2012-01-01 | <current_date> |
Importantly, an author can be affiliated with more than one institution at the same time, i.e. affiliations can overlap for the same author.
I have tried various combinations of leads and partitions in SQL, but one of the issues I'm having is I'm not able to select the next value that is in a different partition (which I think I'll need to do to get the next publication date for a given author on a different article, for example).
Does anyone have any suggestions on how I might achieve the above in a fairly efficient manner (bearing in mind this will be part of a CTE in a larger query that involves millions of rows)?
CodePudding user response:
One way to resolve it is by adding additional subgroup when institution changes:
WITH cte AS (
SELECT *,
CONDITIONAL_CHANGE_EVENT(institution) OVER(PARTITION BY author_id
ORDER BY publication_date) AS subgrp
FROM tab
)
SELECT author_id, institution,
MIN(publication_date) AS start_date,
MAX(publication_date) AS end_date
FROM cte
GROUP BY author_id, institution,subgrp
ORDER BY author_id,start_date;
Output:
CodePudding user response:
Try the following left join query:
SELECT D.author_id, D.institution, D.start_date,
COALESCE(MIN(T.publication_date), CURRENT_DATE) AS end_date
FROM
(
SELECT author_id, institution, MIN(publication_date) AS start_date
FROM affiliations
GROUP BY author_id, institution
) D
LEFT JOIN affiliations T
ON D.author_id = T.author_id AND D.institution <> T.institution
AND D.start_date < T.publication_date
GROUP BY D.author_id, D.institution, D.start_date
ORDER BY D.author_id, D.institution
The output according to your input: