Home > database >  Extract specific column from csv stored in S3
Extract specific column from csv stored in S3

Time:07-20

I am querying Athena thru lambda. Results are getting stored in csv format in S3 bucket.

The csv files has two columns - EventTime and instance id.

I am reading csv file via one of function in my lambda handler:

def read_instanceids(path):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('aws-athena-query-results-mybucket-us-east-1')
    obj = bucket.Object(key= path)
    response = obj.get()
    lines = response['Body'].read().decode('utf-8').split()
    return lines**

Output:

[
  "\"eventTime\",\"instanceId\"",
  "\"2021-09-27T19:46:08Z\",\"\"\"i-0aa1f4dd\"\"\"",
  "\"2021-09-27T21:04:13Z\",\"\"\"i-0465c287\"\"\"",
  "\"2021-09-27T21:10:48Z\",\"\"\"i-08b75f79\"\"\"",
  "\"2021-09-27T19:40:43Z\",\"\"\"i-0456700b\"\"\"",
  "\"2021-03-29T21:58:40Z\",\"\"\"i-0724f99f\"\"\"",
  "\"2021-03-29T23:27:44Z\",\"\"\"i-0fafbe64\"\"\"",
  "\"2021-03-29T21:41:12Z\",\"\"\"i-0064a8552\"\"\"",
  "\"2021-03-29T23:19:09Z\",\"\"\"i-07f5f08e5\"\"\""
]

I want to store only my instance ids in one array.

How I can achieve that. I cant use Pandas/Numpy.

If I am using get_query_results - and returning the response - its in the below format:

[
  {
    "Data": [
      {
        "VarCharValue": "eventTime"
      },
      {
        "VarCharValue": "instanceId"
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-09-23T22:36:15Z"
      },
      {
        "VarCharValue": "\"i-053090803\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T21:58:40Z"
      },
      {
        "VarCharValue": "\"i-0724f62a\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T21:41:12Z"
      },
      {
        "VarCharValue": "\"i-552\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T23:19:09Z"
      },
      {
        "VarCharValue": "\"i-07f4e5\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T23:03:09Z"
      },
      {
        "VarCharValue": "\"i-0eb453\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T19:18:11Z"
      },
      {
        "VarCharValue": "\"i-062120\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T18:15:26Z"
      },
      {
        "VarCharValue": "\"i-0121a04\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T23:27:44Z"
      },
      {
        "VarCharValue": "\"i-0f213\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T18:07:05Z"
      },
      {
        "VarCharValue": "\"i-0ee19d8\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-04-28T14:49:22Z"
      },
      {
        "VarCharValue": "\"i-04ad3c29\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-04-28T14:38:43Z"
      },
      {
        "VarCharValue": "\"i-7c6166\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T19:13:42Z"
      },
      {
        "VarCharValue": "\"i-07bc579d\""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-04-29T19:47:34Z"
      },
      {
        "VarCharValue": "\"i-0b8bc7df5\""
      }
    ]
  }
  ]

CodePudding user response:

IF your list was valid, you can do:

l = [ "eventTime",
      "instanceId",
      "2021-09-27T19:46:08Z",
      "i-0aa1f4dd",
      "2021-09-27T21:04:13Z",
      """i-0465c287""",
      "2021-09-27T21:10:48Z",
      """i-08b75f79""",
      "2021-09-27T19:40:43Z",
      """i-0456700b""",
      "2021-03-29T21:58:40Z",
      """i-0724f99f""",
      "2021-03-29T23:27:44Z",
      """i-0fafbe64""",
      "2021-03-29T21:41:12Z",
      """i-0064a8552""",
      "2021-03-29T23:19:09Z",
      """i-07f5f08e5""" ]
print(l[2:][1::2])
['i-0aa1f4dd', 'i-0465c287', 'i-08b75f79', 'i-0456700b', 'i-0724f99f', 'i-0fafbe64', 'i-0064a8552', 'i-07f5f08e5']

CodePudding user response:

Python has csv module in standard library. https://docs.python.org/3/library/csv.html

But in this use case, if instanceIds doesn't contain comma you can split lines by comma, take second field and strip double quotes.

def read_instanceids(path):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('aws-athena-query-results-mybucket-us-east-1')
    obj = bucket.Object(key= path)
    response = obj.get()
    lines = response['Body'].read().decode('utf-8').split()
    return [line.split(',')[1].strip('"') for line in lines[1:]]

CodePudding user response:

You can use the result returned from Amazon Athena via get_query_results().

If the data variable contains the JSON shown in your question, you can extract a list of the instances with:

rows = [row['Data'][1]['VarCharValue'].replace('"', '') for row in data]
print(rows)

The output is:

['instanceId', 'i-053090803', 'i-0724f62a', 'i-552', 'i-07f4e5', 'i-0eb453', 'i-062120', 'i-0121a04', 'i-0f213', 'i-0ee19d8', 'i-04ad3c29', 'i-7c6166', 'i-07bc579d', 'i-0b8bc7df5']

You can skip the column header by referencing: rows[1:]

  • Related