Home > Software design >  Getting only most recent timestamp from a set of distinct values, so there are no repeats of 'd
Getting only most recent timestamp from a set of distinct values, so there are no repeats of 'd

Time:07-06

We are working with database-stored post-Bro-processed network events from malware lab infections in order to populate a MISP instance with the data. The underlying database system is PostgreSQL.

We have the code that updates the MISP side working fine, but because there's tons of repeat events with different timestamps, we can have data results in the thousands for a distinct pair of ip and port.

An example of values would be something like (note: datatypes are included with the header row in the table for clarity on the way this is stored in the SQL DB):

ts (timestamp w/ timezone) resp_h (inet) resp_p (integer)
2022-07-05 07:37:50.869766 00:00 52.254.114.69 443
2022-07-05 06:29:37.149036 00:00 64.62.200.237 443
2022-07-05 06:29:24.205741 00:00 64.62.200.237 443
2022-06-28 12:44:52.303022 00:00 64.62.200.237 443
...

This data is as 'distinct' as we can get it, selecting from our table as follows (note that INFECTIONID is an integer value that represents a specific infection ID in the system):

SELECT DISTINCT ts, resp_h, resp_p 
FROM bro_data WHERE infection=INFECTIONID 
  AND resp_h IS NOT NULL 
  AND resp_p IS NOT NULL
ORDER BY ts DESC

Now, I am almost certain we can narrow this down further, but I'm not entirely sure how to. What I'd like to do is narrow down the data list so that we get one record for each resp_h and resp_p value that has only one timestamp - the latest timestamp for a record.

Using the above three example values out of thousands, the desired results would be the return of exactly two values (again, datavalue types along with header for consistency with above table):

ts (timestamp w/ timezone) resp_h (inet) resp_p (integer)
2022-07-05 07:37:50.869766 00:00 52.254.114.69 443
2022-07-05 06:29:37.149036 00:00 64.62.200.237 443

I'm not as SQL experienced as I am Python experienced, and I'd like to not have to preprocess the thousands of data values at the Python code that processes the data into the MISP events on our end. Is there anyone who can help guide me with proper SQL search syntax to get only the desired results?

CodePudding user response:

As you need always the latest timestamp then it should be:

SELECT max(ts) AS latest_ts, resp_h, resp_p
           FROM bro_data
           WHERE infection=INFECTIONID 
  AND resp_h IS NOT NULL 
  AND resp_p IS NOT NULL
          GROUP BY 2,3;

CodePudding user response:

Untested:

SELECT
   distinct on(resp_h, rsp_p) ts, resp_h, resp_p
FROM
   (SELECT 
      ts, resp_h, resp_p
   FROM 
      bro_data 
   WHERE 
      infection=INFECTIONID 
   AND 
      resp_h IS NOT NULL 
   AND 
      resp_p IS NOT NULL
   ORDER BY 
      ts DESC)
   AS  
      ts_desc
  • Related