SELECT *
FROM table_name
WHERE
from_unixtime(end_time,'yyyyMMdd') >= '20211201'
and from_unixtime(start_time,'yyyyMMdd') <= '20211231'
I want to substitute '20211201' and '20211231' for every month between year 2020 to 2021, one way to do it is to substitute strings by hands and split it into multiple queries like this:
SELECT *
FROM table_name
WHERE
from_unixtime(end_time,'yyyyMMdd') >= '20211201'
and from_unixtime(start_time,'yyyyMMdd') <= '20211231'
SELECT *
FROM table_name
WHERE
from_unixtime(end_time,'yyyyMMdd') >= '20211101'
and from_unixtime(start_time,'yyyyMMdd') <= '20211130'
Is there any way I can put them in one query? I'm thinking a for loop or full join a column with
month
20211031
20221130
20211231
...
but I don‘t know a elegant way to generate a series of months between two dates either..
CodePudding user response:
WITH table_name AS (
SELECT 0 user_id, unix_timestamp('2022-01-02 12:00:00') start_time, unix_timestamp('2022-01-04 12:00:00') end_time UNION ALL
SELECT 1 user_id, unix_timestamp('2022-01-04 12:00:00') start_time, unix_timestamp('2022-01-08 12:00:00') end_time UNION ALL
SELECT 2 user_id, unix_timestamp('2022-02-02 12:00:00') start_time, unix_timestamp('2022-04-04 12:00:00') end_time UNION ALL
SELECT 3 user_id, unix_timestamp('2022-02-04 12:00:00') start_time, unix_timestamp('2022-02-08 12:00:00') end_time UNION ALL
SELECT 3 user_id, unix_timestamp('2022-02-02 12:00:00') start_time, unix_timestamp('2022-02-04 12:00:00') end_time UNION ALL
SELECT 4 user_id, unix_timestamp('2022-04-02 12:00:00') start_time, unix_timestamp('2022-05-05 12:00:00') end_time
),
explodes AS (
SELECT *, date_format(add_months(from_unixtime(t.start_time), diff), 'yyyyMM') month
FROM table_name t
LATERAL VIEW EXPLODE (
-- this hack is due to lack of supporting *generate_array* functions and non-equi joins in Hive
-- you can increase below map based on the period seen on your data
map(0, array(0), 1, array(0, 1), 2, array(0, 1, 2), 3, array(0, 1, 2, 3), 4, array(0, 1, 2, 3, 4))[
int(months_between(from_unixtime(t.end_time), from_unixtime(t.start_time)))
]) tf AS diff
)
SELECT month, COUNT(DISTINCT user_id) mau FROM explodes GROUP BY month;
previous approach with more limitations
Assuming that
- your data is big enough not to miss any months between period you're considering.
- you can set hive option below for non-equi join
Cartesian products are disabled for safety reasons. If you know what you are doing, please sethive.strict.checks.cartesian.product to false and that hive.mapred.mode is not set to 'strict' to proceed.
WITH table_name AS (
-- same as above
),
months AS (
SELECT from_unixtime(start_time, 'yyyyMM') year_month FROM table_name
UNION DISTINCT
SELECT from_unixtime(end_time, 'yyyyMM') FROM table_name
)
SELECT m.year_month, COUNT(DISTINCT user_id) mau
FROM months m LEFT JOIN table_name t ON m.year_month BETWEEN from_unixtime(t.start_date, 'yyyyMM') AND from_unixtime(t.end_date, 'yyyyMM')
GROUP BY year_month;