I am working on an SQL query that a colleague set up some time ago, he has since left the company and I had to update the software that creates the table. I am not sure if the data in the column has changed format since he wrote it, but one part has stopped working and my SQL knowledge is not good enough to understand how to filter these results out again.
The data is collected using cloudquery from AWS regarding ec2 instances, currently my (modified) query looks like:
SELECT cluster_name
,MAX(flatcar_versions.id-1) AS lag_max
,MIN(flatcar_versions.id-1) AS lag_min
,COUNT(aws_ec2_instances.instance_id) AS instances
,COUNT(aws_ec2_instances.instance_id) filter (where flatcar_versions.id <= 1) * 100 / COUNT(aws_ec2_instances.instance_id) AS unpatched_percentage
FROM
(
SELECT account_id
,aws_ec2_instances.image_id
,aws_ami.name
,aws_ec2_instances.instance_id
,(regexp_matches(aws_ami.name,'^Flatcar-stable-(.*)-hvm$'))[1] AS flatcar_release
,(regexp_matches(key_name,'^(.*)-.*$'))[1] AS cluster_name
,(regexp_matches(key_name,'^.*-(.*$)'))[1] AS resource_name
FROM aws_ec2_instances
JOIN aws_ami
ON aws_ec2_instances.image_id = aws_ami.image_id
AND key_name not like 'packer%'
UNION
SELECT aws_ec2_instances.account_id
,aws_ec2_instances.image_id
,aws_ec2_images.name
,aws_ec2_instances.instance_id
,aws_ec2_images.tags::json->>'FlatcarVersion' AS flatcar_release
,(regexp_matches(key_name,'^(.*)-.*$'))[1] AS cluster_name
,(regexp_matches(key_name,'^.*-(.*$)'))[1] AS resource_name
FROM aws_ec2_instances
JOIN aws_ec2_images
ON aws_ec2_instances.image_id = aws_ec2_images.image_id
AND key_name not like 'packer%'
) AS aws_ec2_instances
JOIN flatcar_versions
ON aws_ec2_instances.flatcar_release = flatcar_versions.version
GROUP BY cluster_name;
This works and returns me a table like:
cluster_name | lag_max | lag_min | instances | unpatched_percentage |
---|---|---|---|---|
one | 1 | 0 | 10 | 90 |
two | 20 | 0 | 12 | 91 |
three | 0 | 0 | 15 | 100 |
four | 11 | 0 | 15 | 80 |
five | 20 | 0 | 12 | 91 |
six | 4 | 0 | 11 | 81 |
seven | 5 | 5 | 13 | 0 |
eight | 2 | 0 | 11 | 90 |
nine | 1 | 0 | 30 | 63 |
ten | 38 | 38 | 1 | 0 |
The only issue is, that before I updated and made my changes, this query was able to filter out columns where the instance state state is 'terminated'. I am not certain how the field used to look, but currently the only field I can find with that info in it is stored in json. I think if it just had the string it would be easier (I think that's how it used to be stored) but since it's json I can't get my head around what to do to filter these results out.
I am able to run the following simple query, this gives me the text in the field I am looking for (i.e. normally 'running' or 'terminated'):
SELECT
state -> 'Name' AS state_name
FROM
aws_ec2_instances
My returned data is a simple table with the states as the only info in the one column. But I don't understand how I would filter the results in my first query by the value in the json column.
Previously there were two filters configured like so:
AND key_name not like 'packer%'
AND state_name != 'terminated'
These were following each of the ON statements. But when I try that now I get an SQL error (the first filter worked so I kept it, but when I had the second it returned me errors so I removed it while I tried to figure it out):
ERROR: column "state_name" does not exist
I thought this would fix it, by creating the field in the SELECT statement, I thought I would be able to reference it in an AND statement, but I get the same 'column does not exist' error. This is my query:
SELECT cluster_name
,MAX(flatcar_versions.id-1) AS lag_max
,MIN(flatcar_versions.id-1) AS lag_min
,COUNT(aws_ec2_instances.instance_id) AS instances
,COUNT(aws_ec2_instances.instance_id) filter (where flatcar_versions.id <= 1) * 100 / COUNT(aws_ec2_instances.instance_id) AS unpatched_percentage
FROM
(
SELECT account_id
,aws_ec2_instances.image_id
,aws_ami.name
,aws_ec2_instances.instance_id
,(regexp_matches(aws_ami.name,'^Flatcar-stable-(.*)-hvm$'))[1] AS flatcar_release
,(regexp_matches(key_name,'^(.*)-.*$'))[1] AS cluster_name
,(regexp_matches(key_name,'^.*-(.*$)'))[1] AS resource_name
,aws_ec2_instances.state -> 'Name' AS state_name
FROM aws_ec2_instances
JOIN aws_ami
ON aws_ec2_instances.image_id = aws_ami.image_id
AND key_name not like 'packer%'
AND state_name != 'terminated'
UNION
SELECT aws_ec2_instances.account_id
,aws_ec2_instances.image_id
,aws_ec2_images.name
,aws_ec2_instances.instance_id
,aws_ec2_images.tags::json->>'FlatcarVersion' AS flatcar_release
,(regexp_matches(key_name,'^(.*)-.*$'))[1] AS cluster_name
,(regexp_matches(key_name,'^.*-(.*$)'))[1] AS resource_name
,aws_ec2_instances.state -> 'Name' as state_name
FROM aws_ec2_instances
JOIN aws_ec2_images
ON aws_ec2_instances.image_id = aws_ec2_images.image_id
AND key_name not like 'packer%'
AND state_name != 'terminated'
) AS aws_ec2_instances
JOIN flatcar_versions
ON aws_ec2_instances.flatcar_release = flatcar_versions.version
GROUP BY cluster_name;
I also tried this query as I thought maybe I needed to create the column in the first SELECT, but again, same error:
SELECT cluster_name
,MAX(flatcar_versions.id-1) AS lag_max
,MIN(flatcar_versions.id-1) AS lag_min
,COUNT(aws_ec2_instances.instance_id) AS instances
,COUNT(aws_ec2_instances.instance_id) filter (where flatcar_versions.id <= 1) * 100 / COUNT(aws_ec2_instances.instance_id) AS unpatched_percentage
,state_name
FROM
(
SELECT account_id
,aws_ec2_instances.image_id
,aws_ami.name
,aws_ec2_instances.instance_id
,(regexp_matches(aws_ami.name,'^Flatcar-stable-(.*)-hvm$'))[1] AS flatcar_release
,(regexp_matches(key_name,'^(.*)-.*$'))[1] AS cluster_name
,(regexp_matches(key_name,'^.*-(.*$)'))[1] AS resource_name
,aws_ec2_instances.state -> 'Name' AS state_name
FROM aws_ec2_instances
JOIN aws_ami
ON aws_ec2_instances.image_id = aws_ami.image_id
AND key_name not like 'packer%'
AND state_name != 'terminated'
UNION
SELECT aws_ec2_instances.account_id
,aws_ec2_instances.image_id
,aws_ec2_images.name
,aws_ec2_instances.instance_id
,aws_ec2_images.tags::json->>'FlatcarVersion' AS flatcar_release
,(regexp_matches(key_name,'^(.*)-.*$'))[1] AS cluster_name
,(regexp_matches(key_name,'^.*-(.*$)'))[1] AS resource_name
,aws_ec2_instances.state -> 'Name' as state_name
FROM aws_ec2_instances
JOIN aws_ec2_images
ON aws_ec2_instances.image_id = aws_ec2_images.image_id
AND key_name not like 'packer%'
AND state_name != 'terminated'
) AS aws_ec2_instances
JOIN flatcar_versions
ON aws_ec2_instances.flatcar_release = flatcar_versions.version
GROUP BY cluster_name;
I have tried searching around the subject and reading SQL query guides, but I think my case is quite specific and I have these nested queries that I don't fully understand...
Can anyone explain how to filter the results so if the column has a value of terminated in the Name field of the state
json then the column is removed?
CodePudding user response:
Here is a guess - instead of this:
AND state_name != 'terminated'
Try this:
AND aws_ec2_instances.state ->> 'Name' != 'terminated'
https://www.postgresql.org/docs/current/functions-json.html
json ->> text → text
jsonb ->> text → text
Extracts JSON object field with the given key, as text.
'{"a":1,"b":2}'::json ->> 'b' → 2
examples
https://hub.steampipe.io/plugins/turbot/aws/tables/aws_ec2_instance