I am trying to sum together content_len
that are entered into the database a few seconds apart (colour coded). The table currently breaks into a new row once the character count hits 999, and inserts each overflow a few seconds apart. Due to errors, the overflow can be timestamped earlier than the previous body.
My current attempt is to round the timestamp to unix seconds, but issues occur when the rounding (unix_ceil
) does not produce the same number for content that are to be grouped together. How can I ensure that entries that are within ~20 seconds of each other are summed together? Usually there is a separation of at least a few minutes between distinct records that are not to be grouped together (e.g. 999 37 at ~0950am and then 136 at ~0959am for source
= 1).
SELECT source
,entry_dt
,SUM(content_len) AS full_length
FROM (
SELECT source
,entry_dt
,entry_time
,(TO_DATE(CONCAT(entry_dt, entry_time), 'yyyymmddHH24MISS') - TO_DATE('2020-01-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS')) * 86400 AS unix_from_2020
,CEIL(86400 * (TO_DATE(CONCAT(entry_dt, entry_time), 'yyyymmddHH24MISS') - TO_DATE('2020-01-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS')) / 20) * 20 AS unix_ceil -- round to 20 seconds
,content_len
FROM schema.text_length_records
) s
GROUP BY source, entry_dt, unix_ceil
CodePudding user response:
Do not store dates and times separately and do not store them in non-DATE
data types. In Oracle a DATE
is a binary data type consisting of 7 bytes which contain the components century, year-of-century, month, day, hour, minute and second and it ALWAYS has those components and is NEVER stored in any particular format.
You can then use a single DATE
column to store both date and time and do it more efficiently and with better error checking than if you store the values separately as strings or numbers.
From Oracle 12, you can use MATCH_RECOGNIZE
to do row-by-row processing.
If you want each row to be within 20 seconds of the previous row then:
SELECT *
FROM (
SELECT source,
TO_DATE(entry_dt || LPAD(entry_time, '0', 6), 'YYYYMMDDHH24MISS') AS entry_dt,
content_len
FROM text_length_records
)
MATCH_RECOGNIZE(
PARTITION BY source
ORDER BY entry_dt
MEASURES
FIRST(entry_dt) AS start_entry_dt,
LAST(entry_dt) AS end_entry_dt,
SUM(content_len) AS content_len
ONE ROW PER MATCH
PATTERN (within_20* last_time)
DEFINE
within_20 AS entry_dt INTERVAL '20' SECOND >= NEXT(entry_dt)
)
If you want each row to be within 20 seconds of the first row of the group then:
SELECT *
FROM (
SELECT source,
TO_DATE(entry_dt || LPAD(entry_time, 6, '0'), 'YYYYMMDDHH24MISS') AS entry_dt,
content_len
FROM text_length_records
)
MATCH_RECOGNIZE(
PARTITION BY source
ORDER BY entry_dt
MEASURES
FIRST(entry_dt) AS start_entry_dt,
LAST(entry_dt) AS end_entry_dt,
SUM(content_len) AS content_len
ONE ROW PER MATCH
PATTERN (within_20*)
DEFINE
within_20 AS entry_dt <= FIRST(entry_dt) INTERVAL '20' SECOND
)
Which, for the sample data:
CREATE TABLE text_length_records (source, entry_dt, entry_time, content_len) AS
SELECT 1, 20210910, 95059, 37 FROM DUAL UNION ALL
SELECT 1, 20210910, 95102, 999 FROM DUAL UNION ALL
SELECT 1, 20210910, 95959, 139 FROM DUAL UNION ALL
SELECT 2, 20210910, 83320, 999 FROM DUAL UNION ALL
SELECT 2, 20210910, 83322, 999 FROM DUAL UNION ALL
SELECT 2, 20210910, 83324, 456 FROM DUAL;
Both output:
SOURCE START_ENTRY_DT END_ENTRY_DT CONTENT_LEN 1 2021-09-10 09:50:59 2021-09-10 09:51:02 1036 1 2021-09-10 09:59:59 2021-09-10 09:59:59 139 2 2021-09-10 08:33:20 2021-09-10 08:33:24 2454
Note: Although the queries produce the same output for your sample data, they will produce slightly different outputs if you did have any sample data where the 3th row was not within 20 seconds of the 1st row of the group but was within 20 seconds of the 2nd row of the group.
db<>fiddle here