I am querying Athena thru lambda. Results are getting stored in csv format in S3 bucket.
The csv files has two columns - EventTime and instance id.
I am reading csv file via one of function in my lambda handler:
def read_instanceids(path):
s3 = boto3.resource('s3')
bucket = s3.Bucket('aws-athena-query-results-mybucket-us-east-1')
obj = bucket.Object(key= path)
response = obj.get()
lines = response['Body'].read().decode('utf-8').split()
return lines**
Output:
[
"\"eventTime\",\"instanceId\"",
"\"2021-09-27T19:46:08Z\",\"\"\"i-0aa1f4dd\"\"\"",
"\"2021-09-27T21:04:13Z\",\"\"\"i-0465c287\"\"\"",
"\"2021-09-27T21:10:48Z\",\"\"\"i-08b75f79\"\"\"",
"\"2021-09-27T19:40:43Z\",\"\"\"i-0456700b\"\"\"",
"\"2021-03-29T21:58:40Z\",\"\"\"i-0724f99f\"\"\"",
"\"2021-03-29T23:27:44Z\",\"\"\"i-0fafbe64\"\"\"",
"\"2021-03-29T21:41:12Z\",\"\"\"i-0064a8552\"\"\"",
"\"2021-03-29T23:19:09Z\",\"\"\"i-07f5f08e5\"\"\""
]
I want to store only my instance ids in one array.
How I can achieve that. I cant use Pandas/Numpy.
If I am using get_query_results - and returning the response - its in the below format:
[
{
"Data": [
{
"VarCharValue": "eventTime"
},
{
"VarCharValue": "instanceId"
}
]
},
{
"Data": [
{
"VarCharValue": "2021-09-23T22:36:15Z"
},
{
"VarCharValue": "\"i-053090803\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-29T21:58:40Z"
},
{
"VarCharValue": "\"i-0724f62a\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-29T21:41:12Z"
},
{
"VarCharValue": "\"i-552\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-29T23:19:09Z"
},
{
"VarCharValue": "\"i-07f4e5\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-29T23:03:09Z"
},
{
"VarCharValue": "\"i-0eb453\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-30T19:18:11Z"
},
{
"VarCharValue": "\"i-062120\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-30T18:15:26Z"
},
{
"VarCharValue": "\"i-0121a04\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-29T23:27:44Z"
},
{
"VarCharValue": "\"i-0f213\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-30T18:07:05Z"
},
{
"VarCharValue": "\"i-0ee19d8\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-04-28T14:49:22Z"
},
{
"VarCharValue": "\"i-04ad3c29\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-04-28T14:38:43Z"
},
{
"VarCharValue": "\"i-7c6166\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-30T19:13:42Z"
},
{
"VarCharValue": "\"i-07bc579d\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-04-29T19:47:34Z"
},
{
"VarCharValue": "\"i-0b8bc7df5\""
}
]
}
]
CodePudding user response:
IF your list was valid, you can do:
l = [ "eventTime",
"instanceId",
"2021-09-27T19:46:08Z",
"i-0aa1f4dd",
"2021-09-27T21:04:13Z",
"""i-0465c287""",
"2021-09-27T21:10:48Z",
"""i-08b75f79""",
"2021-09-27T19:40:43Z",
"""i-0456700b""",
"2021-03-29T21:58:40Z",
"""i-0724f99f""",
"2021-03-29T23:27:44Z",
"""i-0fafbe64""",
"2021-03-29T21:41:12Z",
"""i-0064a8552""",
"2021-03-29T23:19:09Z",
"""i-07f5f08e5""" ]
print(l[2:][1::2])
['i-0aa1f4dd', 'i-0465c287', 'i-08b75f79', 'i-0456700b', 'i-0724f99f', 'i-0fafbe64', 'i-0064a8552', 'i-07f5f08e5']
CodePudding user response:
Python has csv module in standard library. https://docs.python.org/3/library/csv.html
But in this use case, if instanceIds doesn't contain comma you can split lines by comma, take second field and strip double quotes.
def read_instanceids(path):
s3 = boto3.resource('s3')
bucket = s3.Bucket('aws-athena-query-results-mybucket-us-east-1')
obj = bucket.Object(key= path)
response = obj.get()
lines = response['Body'].read().decode('utf-8').split()
return [line.split(',')[1].strip('"') for line in lines[1:]]
CodePudding user response:
You can use the result returned from Amazon Athena via get_query_results()
.
If the data
variable contains the JSON shown in your question, you can extract a list of the instances with:
rows = [row['Data'][1]['VarCharValue'].replace('"', '') for row in data]
print(rows)
The output is:
['instanceId', 'i-053090803', 'i-0724f62a', 'i-552', 'i-07f4e5', 'i-0eb453', 'i-062120', 'i-0121a04', 'i-0f213', 'i-0ee19d8', 'i-04ad3c29', 'i-7c6166', 'i-07bc579d', 'i-0b8bc7df5']
You can skip the column header by referencing: rows[1:]