how to query multiple months in where clauses in hql?-CodePudding

SELECT *
FROM table_name
WHERE
    from_unixtime(end_time,'yyyyMMdd') >= '20211201'
and from_unixtime(start_time,'yyyyMMdd') <= '20211231'

I want to substitute '20211201' and '20211231' for every month between year 2020 to 2021, one way to do it is to substitute strings by hands and split it into multiple queries like this:

SELECT *
FROM table_name
WHERE
    from_unixtime(end_time,'yyyyMMdd') >= '20211201'
and from_unixtime(start_time,'yyyyMMdd') <= '20211231'

SELECT *
FROM table_name
WHERE
    from_unixtime(end_time,'yyyyMMdd') >= '20211101'
and from_unixtime(start_time,'yyyyMMdd') <= '20211130'

Is there any way I can put them in one query? I'm thinking a for loop or full join a column with

but I don‘t know a elegant way to generate a series of months between two dates either..

CodePudding user response：

WITH table_name AS (
 SELECT 0 user_id, unix_timestamp('2022-01-02 12:00:00') start_time, unix_timestamp('2022-01-04 12:00:00') end_time  UNION ALL
 SELECT 1 user_id, unix_timestamp('2022-01-04 12:00:00') start_time, unix_timestamp('2022-01-08 12:00:00') end_time  UNION ALL
 SELECT 2 user_id, unix_timestamp('2022-02-02 12:00:00') start_time, unix_timestamp('2022-04-04 12:00:00') end_time  UNION ALL
 SELECT 3 user_id, unix_timestamp('2022-02-04 12:00:00') start_time, unix_timestamp('2022-02-08 12:00:00') end_time  UNION ALL
 SELECT 3 user_id, unix_timestamp('2022-02-02 12:00:00') start_time, unix_timestamp('2022-02-04 12:00:00') end_time UNION ALL
 SELECT 4 user_id, unix_timestamp('2022-04-02 12:00:00') start_time, unix_timestamp('2022-05-05 12:00:00') end_time
),
explodes AS (
  SELECT *, date_format(add_months(from_unixtime(t.start_time), diff), 'yyyyMM') month
    FROM table_name t
 LATERAL VIEW EXPLODE (
      -- this hack is due to lack of supporting *generate_array* functions and non-equi joins in Hive
      -- you can increase below map based on the period seen on your data
      map(0, array(0), 1, array(0, 1), 2, array(0, 1, 2), 3, array(0, 1, 2, 3), 4, array(0, 1, 2, 3, 4))[
          int(months_between(from_unixtime(t.end_time), from_unixtime(t.start_time)))
      ]) tf AS diff
)
SELECT month, COUNT(DISTINCT user_id) mau FROM explodes GROUP BY month;

previous approach with more limitations

Assuming that

your data is big enough not to miss any months between period you're considering.
you can set hive option below for non-equi join

Cartesian products are disabled for safety reasons. If you know what you are doing, please sethive.strict.checks.cartesian.product to false and that hive.mapred.mode is not set to 'strict' to proceed.

WITH table_name AS (
  -- same as above
),
months AS (
  SELECT from_unixtime(start_time, 'yyyyMM') year_month FROM table_name
  UNION DISTINCT
  SELECT from_unixtime(end_time, 'yyyyMM') FROM table_name
)
SELECT m.year_month, COUNT(DISTINCT user_id) mau
  FROM months m LEFT JOIN table_name t ON m.year_month BETWEEN from_unixtime(t.start_date, 'yyyyMM') AND from_unixtime(t.end_date, 'yyyyMM')
 GROUP BY year_month;