How to calculate occurrence depending on months/years-CodePudding

My table looks like that:

ID | Start      | End
1  | 2010-01-02 | 2010-01-04
1  | 2010-01-22 | 2010-01-24
1  | 2011-01-31 | 2011-02-02
2  | 2012-05-02 | 2012-05-08
3  | 2013-01-02 | 2013-01-03
4  | 2010-09-15 | 2010-09-20
4  | 2010-09-30 | 2010-10-05

I'm looking for a way to count the number of occurrences for each ID in a Year per Month. But what is important, If some record has a Start date in the following month compared to the End date (of course from the same year) then occurrence should be counted for both months [e.g. ID 1 in the 3rd row has a situation like that. So in this situation, the occurrence for this ID should be 1 for January and 1 for February].

So I'd like to have it in this way:

Year  | Month | Id | Occurrence
2010  | 01    | 1  | 2
2010  | 09    | 4  | 2
2010  | 10    | 4  | 1
2011  | 01    | 1  | 1
2011  | 02    | 1  | 1
2012  | 05    | 2  | 1
2013  | 01    | 3  | 1

I created only this for now...

    CREATE TABLE IF NOT EXISTS counts AS
    (SELECT 
    id, 
    YEAR (CAST(Start AS DATE)) AS Year_St,
    MONTH (CAST(Start AS DATE)) AS Month_St,
    YEAR (CAST(End AS DATE)) AS Year_End,
    MONTH (CAST(End AS DATE)) AS Month_End
    FROM source)

And I don't know how to move with that further. I'd appreciate your help. I'm using Spark SQL.

CodePudding user response：

Try the following strategy to achieve this:

Note:

I have created few intermediate tables. If you wish you can use sub-query or CTE depending on the permissions
I have taken care of 2 scenarios you mentioned (whether to count it as 1 occurrence or 2 occurrence) as you explained

Query:

Firstly, creating a table with flags to decide whether start and end date are falling on same year and month (1 means YES, 2 means NO):

/* Creating a table with flags whether to count the occurrences once or twice */
CREATE TABLE flagged as 
(
  SELECT *, 
  CASE
      WHEN Year_st = Year_end and Month_st = Month_end then 1
      WHEN Year_st = Year_end and Month_st <> Month_end then 2
      Else 0
  end as flag
  FROM
   (
    SELECT 
     id, 
     YEAR (CAST(Start AS DATE)) AS Year_St,
     MONTH (CAST(Start AS DATE)) AS Month_St,
     YEAR (CAST(End AS DATE)) AS Year_End,
     MONTH (CAST(End AS DATE)) AS Month_End
     FROM source
   ) as calc
)

Now the flag in the above table will have 1 if year and month are same for start and end 2 if month differs. You can have more categories of flag if you have more scenarios.

Secondly, counting the occurrences for flag 1. As we know year and month are same for flag 1, we can take either of it. I have taken start:

/* Counting occurrences only for flag 1 */

CREATE TABLE flg1 as (
SELECT distinct id, year_st, month_st, count(*) as occurrence
FROM flagged
where flag=1
GROUP BY id, year_st, month_st
)

Similarly, counting the occurrences for flag 2. Since month differs for both the dates, we can UNION them before counting to get both the dates in same column:

/* Counting occurrences only for flag 2 */

CREATE TABLE flg2 as 
(
 SELECT distinct id, year_dt, month_dt, count(*) as occurrence
 FROM 
  (
  select ID, year_st as year_dt, month_st as month_dt FROM flagged where flag=2
  UNION
  SELECT ID, year_end as year_dt, month_end as month_dt FROM flagged where flag=2
  ) as unioned
 GROUP BY id, year_dt, month_dt
)

Finally, we just have to SUM the occurrences from both the flags. Note that we use UNION ALL here to combine both the tables. This is very important because we need to count duplicates as well:

/* UNIONING both the final tables and summing the occurrences */

SELECT distinct year, month, id, SUM(occurrence) as occurrence
FROM
 (
  SELECT distinct id, year_st as year, month_st as month, occurrence
  FROM flg1
  
  UNION ALL
  
  SELECT distinct id, year_dt as year, month_dt as month, occurrence
  FROM flg2
 ) as fin_unioned

GROUP BY id, year, month
ORDER BY year, month, id, occurrence desc

Output of above query will be your expected output. I know this is not an optimized one, yet it works perfect. I will update if I come across optimized strategy. Comment if you have question.

db<>fiddle link here

CodePudding user response：

Not sure if this works in Spark SQL.

But if the ranges aren't bigger than 1 month, then just add the extra (different month) to the count.

SELECT YearOcc, MonthOcc, Id
, COUNT(*) as Occurrence
FROM 
(
  SELECT Id
  , YEAR(CAST(Start AS DATE)) as YearOcc
  , MONTH(CAST(Start AS DATE)) as MonthOcc
  FROM source
  
  UNION ALL

  SELECT Id
  , YEAR(CAST(End AS DATE)) as YearOcc
  , MONTH(CAST(End AS DATE)) as MonthOcc
  FROM source
  WHERE MONTH(CAST(Start AS DATE)) < MONTH(CAST(End AS DATE))
) q
GROUP BY YearOcc, MonthOcc, Id
ORDER BY YearOcc, MonthOcc, Id