Home > Software design >  How to filter out certain rows if one column has a certain json value?
How to filter out certain rows if one column has a certain json value?

Time:01-17

I am working on an SQL query that a colleague set up some time ago, he has since left the company and I had to update the software that creates the table. I am not sure if the data in the column has changed format since he wrote it, but one part has stopped working and my SQL knowledge is not good enough to understand how to filter these results out again.

The data is collected using cloudquery from AWS regarding ec2 instances, currently my (modified) query looks like:

SELECT  cluster_name
       ,MAX(flatcar_versions.id-1)                                                                              AS lag_max
       ,MIN(flatcar_versions.id-1)                                                                              AS lag_min
       ,COUNT(aws_ec2_instances.instance_id)                                                                             AS instances
       ,COUNT(aws_ec2_instances.instance_id) filter (where flatcar_versions.id <= 1) * 100 / COUNT(aws_ec2_instances.instance_id) AS unpatched_percentage
FROM
(
    SELECT  account_id
           ,aws_ec2_instances.image_id
           ,aws_ami.name
           ,aws_ec2_instances.instance_id
           ,(regexp_matches(aws_ami.name,'^Flatcar-stable-(.*)-hvm$'))[1] AS flatcar_release
           ,(regexp_matches(key_name,'^(.*)-.*$'))[1]                     AS cluster_name
           ,(regexp_matches(key_name,'^.*-(.*$)'))[1]                     AS resource_name
    FROM aws_ec2_instances
    JOIN aws_ami
    ON aws_ec2_instances.image_id = aws_ami.image_id 
    AND key_name not like 'packer%'
    UNION
    SELECT  aws_ec2_instances.account_id
           ,aws_ec2_instances.image_id
           ,aws_ec2_images.name
           ,aws_ec2_instances.instance_id
           ,aws_ec2_images.tags::json->>'FlatcarVersion' AS flatcar_release
           ,(regexp_matches(key_name,'^(.*)-.*$'))[1]    AS cluster_name
           ,(regexp_matches(key_name,'^.*-(.*$)'))[1]    AS resource_name
    FROM aws_ec2_instances
    JOIN aws_ec2_images
    ON aws_ec2_instances.image_id = aws_ec2_images.image_id
    AND key_name not like 'packer%'
) AS aws_ec2_instances
JOIN flatcar_versions
ON aws_ec2_instances.flatcar_release = flatcar_versions.version
GROUP BY cluster_name;

This works and returns me a table like:

cluster_name lag_max lag_min instances unpatched_percentage
one 1 0 10 90
two 20 0 12 91
three 0 0 15 100
four 11 0 15 80
five 20 0 12 91
six 4 0 11 81
seven 5 5 13 0
eight 2 0 11 90
nine 1 0 30 63
ten 38 38 1 0

The only issue is, that before I updated and made my changes, this query was able to filter out columns where the instance state state is 'terminated'. I am not certain how the field used to look, but currently the only field I can find with that info in it is stored in json. I think if it just had the string it would be easier (I think that's how it used to be stored) but since it's json I can't get my head around what to do to filter these results out.

I am able to run the following simple query, this gives me the text in the field I am looking for (i.e. normally 'running' or 'terminated'):

SELECT 
state -> 'Name' AS state_name
FROM 
aws_ec2_instances 

My returned data is a simple table with the states as the only info in the one column. But I don't understand how I would filter the results in my first query by the value in the json column.

Previously there were two filters configured like so:

AND key_name not like 'packer%'
AND state_name != 'terminated'

These were following each of the ON statements. But when I try that now I get an SQL error (the first filter worked so I kept it, but when I had the second it returned me errors so I removed it while I tried to figure it out):

ERROR: column "state_name" does not exist

I thought this would fix it, by creating the field in the SELECT statement, I thought I would be able to reference it in an AND statement, but I get the same 'column does not exist' error. This is my query:

SELECT  cluster_name
       ,MAX(flatcar_versions.id-1)                                                                              AS lag_max
       ,MIN(flatcar_versions.id-1)                                                                              AS lag_min
       ,COUNT(aws_ec2_instances.instance_id)                                                                             AS instances
       ,COUNT(aws_ec2_instances.instance_id) filter (where flatcar_versions.id <= 1) * 100 / COUNT(aws_ec2_instances.instance_id) AS unpatched_percentage
FROM
(
    SELECT  account_id
           ,aws_ec2_instances.image_id
           ,aws_ami.name
           ,aws_ec2_instances.instance_id
           ,(regexp_matches(aws_ami.name,'^Flatcar-stable-(.*)-hvm$'))[1] AS flatcar_release
           ,(regexp_matches(key_name,'^(.*)-.*$'))[1]                     AS cluster_name
           ,(regexp_matches(key_name,'^.*-(.*$)'))[1]                     AS resource_name
           ,aws_ec2_instances.state -> 'Name'                             AS state_name
    FROM aws_ec2_instances
    JOIN aws_ami
    ON aws_ec2_instances.image_id = aws_ami.image_id 
    AND key_name not like 'packer%'
    AND state_name != 'terminated'
    UNION
    SELECT  aws_ec2_instances.account_id
           ,aws_ec2_instances.image_id
           ,aws_ec2_images.name
           ,aws_ec2_instances.instance_id
           ,aws_ec2_images.tags::json->>'FlatcarVersion' AS flatcar_release
           ,(regexp_matches(key_name,'^(.*)-.*$'))[1]    AS cluster_name
           ,(regexp_matches(key_name,'^.*-(.*$)'))[1]    AS resource_name
           ,aws_ec2_instances.state -> 'Name'            as state_name
    FROM aws_ec2_instances
    JOIN aws_ec2_images
    ON aws_ec2_instances.image_id = aws_ec2_images.image_id
    AND key_name not like 'packer%'
    AND state_name != 'terminated'
) AS aws_ec2_instances
JOIN flatcar_versions
ON aws_ec2_instances.flatcar_release = flatcar_versions.version
GROUP BY cluster_name;

I also tried this query as I thought maybe I needed to create the column in the first SELECT, but again, same error:

SELECT  cluster_name
       ,MAX(flatcar_versions.id-1)                                                                              AS lag_max
       ,MIN(flatcar_versions.id-1)                                                                              AS lag_min
       ,COUNT(aws_ec2_instances.instance_id)                                                                             AS instances
       ,COUNT(aws_ec2_instances.instance_id) filter (where flatcar_versions.id <= 1) * 100 / COUNT(aws_ec2_instances.instance_id) AS unpatched_percentage
       ,state_name
FROM
(
    SELECT  account_id
           ,aws_ec2_instances.image_id
           ,aws_ami.name
           ,aws_ec2_instances.instance_id
           ,(regexp_matches(aws_ami.name,'^Flatcar-stable-(.*)-hvm$'))[1] AS flatcar_release
           ,(regexp_matches(key_name,'^(.*)-.*$'))[1]                     AS cluster_name
           ,(regexp_matches(key_name,'^.*-(.*$)'))[1]                     AS resource_name
           ,aws_ec2_instances.state -> 'Name'                             AS state_name
    FROM aws_ec2_instances
    JOIN aws_ami
    ON aws_ec2_instances.image_id = aws_ami.image_id 
    AND key_name not like 'packer%'
    AND state_name != 'terminated'
    UNION
    SELECT  aws_ec2_instances.account_id
           ,aws_ec2_instances.image_id
           ,aws_ec2_images.name
           ,aws_ec2_instances.instance_id
           ,aws_ec2_images.tags::json->>'FlatcarVersion' AS flatcar_release
           ,(regexp_matches(key_name,'^(.*)-.*$'))[1]    AS cluster_name
           ,(regexp_matches(key_name,'^.*-(.*$)'))[1]    AS resource_name
           ,aws_ec2_instances.state -> 'Name'            as state_name
    FROM aws_ec2_instances
    JOIN aws_ec2_images
    ON aws_ec2_instances.image_id = aws_ec2_images.image_id
    AND key_name not like 'packer%'
    AND state_name != 'terminated'
) AS aws_ec2_instances
JOIN flatcar_versions
ON aws_ec2_instances.flatcar_release = flatcar_versions.version
GROUP BY cluster_name;

I have tried searching around the subject and reading SQL query guides, but I think my case is quite specific and I have these nested queries that I don't fully understand...

Can anyone explain how to filter the results so if the column has a value of terminated in the Name field of the state json then the column is removed?

CodePudding user response:

Here is a guess - instead of this:

AND state_name != 'terminated'

Try this:

AND aws_ec2_instances.state ->> 'Name' != 'terminated'

https://www.postgresql.org/docs/current/functions-json.html

json ->> text → text

jsonb ->> text → text

Extracts JSON object field with the given key, as text.

'{"a":1,"b":2}'::json ->> 'b' → 2

examples

https://hub.steampipe.io/plugins/turbot/aws/tables/aws_ec2_instance

  • Related