We are working with database-stored post-Bro-processed network events from malware lab infections in order to populate a MISP instance with the data. The underlying database system is PostgreSQL.
We have the code that updates the MISP side working fine, but because there's tons of repeat events with different timestamps, we can have data results in the thousands for a distinct pair of ip and port.
An example of values would be something like (note: datatypes are included with the header row in the table for clarity on the way this is stored in the SQL DB):
ts (timestamp w/ timezone) | resp_h (inet) | resp_p (integer) |
---|---|---|
2022-07-05 07:37:50.869766 00:00 | 52.254.114.69 | 443 |
2022-07-05 06:29:37.149036 00:00 | 64.62.200.237 | 443 |
2022-07-05 06:29:24.205741 00:00 | 64.62.200.237 | 443 |
2022-06-28 12:44:52.303022 00:00 | 64.62.200.237 | 443 |
... |
This data is as 'distinct' as we can get it, selecting from our table as follows (note that INFECTIONID
is an integer value that represents a specific infection ID in the system):
SELECT DISTINCT ts, resp_h, resp_p
FROM bro_data WHERE infection=INFECTIONID
AND resp_h IS NOT NULL
AND resp_p IS NOT NULL
ORDER BY ts DESC
Now, I am almost certain we can narrow this down further, but I'm not entirely sure how to. What I'd like to do is narrow down the data list so that we get one record for each resp_h
and resp_p
value that has only one timestamp - the latest timestamp for a record.
Using the above three example values out of thousands, the desired results would be the return of exactly two values (again, datavalue types along with header for consistency with above table):
ts (timestamp w/ timezone) | resp_h (inet) | resp_p (integer) |
---|---|---|
2022-07-05 07:37:50.869766 00:00 | 52.254.114.69 | 443 |
2022-07-05 06:29:37.149036 00:00 | 64.62.200.237 | 443 |
I'm not as SQL experienced as I am Python experienced, and I'd like to not have to preprocess the thousands of data values at the Python code that processes the data into the MISP events on our end. Is there anyone who can help guide me with proper SQL search syntax to get only the desired results?
CodePudding user response:
As you need always the latest timestamp then it should be:
SELECT max(ts) AS latest_ts, resp_h, resp_p
FROM bro_data
WHERE infection=INFECTIONID
AND resp_h IS NOT NULL
AND resp_p IS NOT NULL
GROUP BY 2,3;
CodePudding user response:
Untested:
SELECT
distinct on(resp_h, rsp_p) ts, resp_h, resp_p
FROM
(SELECT
ts, resp_h, resp_p
FROM
bro_data
WHERE
infection=INFECTIONID
AND
resp_h IS NOT NULL
AND
resp_p IS NOT NULL
ORDER BY
ts DESC)
AS
ts_desc