Which library is best to use among "boto3" and "Psycopg2" for redshift operations in python lambda functions:
- Lookup for a table in redshift cluster
- Create a table in redshift cluster
- Insert data in redshift cluster
I would appretiate if i am answered with following:
- python code for either of the library that addresses all of the above 3 needs.
Thanks in Advance!!
CodePudding user response:
Connecting directly to Redshift from Lambda with psycopg2 is the simpler, more straight-forward way to go but comes with a significant limitation. Lambda functions have run-time limits and even if your SQL commands don't exceed the max run-time, you will be paying for the Lambda function to wait for Redshift to complete the SQL. For fast-running SQL commands things run quickly and this isn't a problem but inserting data can take some time depending on the amount of data.
If all your Redshift actions are less than a few seconds (and won't grow longer with time) then psycopg2 connecting directly to Redshift is likely the way to go. If the data insert takes a minute or 2 BUT this process doesn't run very often (daily) then psycopg2 may still be the way to go as Lambda isn't very expensive when run in frequently. It is a process simplicity vs. cost calculation.
Using Redshift Data API is more complicated. This process lets you fire the SQL to Redshift and terminate the Lambda. A later running Lambda checks to see if the SQL has completed and the results of the SQL are checked. The SQL not completing means that Lambda needs to be invoke at a later time to see if things are complete. This polling process often is done by a Step Function and a set of different Lambda functions. Not super difficult but a level of complexity above a single Lambda. Since this is a polling process there is a wait time between checks for results which if too long leads to latency and if too short over-polling and additional costs.
If you need to have Data API for time-out reasons then you may want to use both psycopg2 for short running queries to the database - like 'does this table exist?'. Use Data API for long-running steps like 'insert this 1TB set of data into Redshift'.
CodePudding user response:
Sample basic python code for all three operations using boto3.
import json
import boto3
clientdata = boto3.client('redshift-data')
# looks up table and returns true if found
def lookup_table(table_name):
response = clientdata.list_tables(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
TablePattern=table_name
)
print(response)
if ( len(response['Tables']) == 0 ):
return False
else:
return True
# creates table with one integer column
def create_table(table_name):
sqlstmt = 'CREATE TABLE ' table_name ' (col1 integer);'
print(sqlstmt)
response = clientdata.execute_statement(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
Sql=sqlstmt,
StatementName='CreateTable'
)
print(response)
# inserts one row with integer value for col1
def insert_data(table_name, dval):
print(dval)
sqlstmt = 'INSERT INTO ' table_name '(col1) VALUES (' str(dval) ');'
response = clientdata.execute_statement(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
Sql=sqlstmt,
StatementName='InsertData'
)
print(response)
result = lookup_table('date')
if ( result ):
print("Table exists.")
else:
print("Table does not exist!")
create_table("testtab")
insert_data("testtab", 11)
I am not using Lambda, instead executing it just from my shell.
Hope this helps. Assuming credentials and default region are already set up for the client.