Home > Enterprise >  Import CSV parts from S3 into RDS Aurora PostgresSQL
Import CSV parts from S3 into RDS Aurora PostgresSQL

Time:08-16

I spent some time fiddling with the tiny details of the AWS S3 extension for Postgres described here https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/postgresql-s3-export.html#postgresql-s3-export-access-bucket (postgres extension configuration, roles, policies, tiny function input details).

I want to easily export, then import huge tables for testing purposes (indexes, generated columns, partitions etc) to optimize the database performance.

I am using this extension because I want to avoid to use my laptop to store the file with stuff like the following command which involves a lot of network I/O and is affected by slow internet connections, broken pipes when the connection is being nuked by the Operating System after a while and more of these problems related to huge tables:

# store CSV from S3 to local
aws s3 cp s3://my_bucket/my_sub_path/my_file.csv /my_local_directory/my_file.csv
# import from local CSV to AWS RDS Aurora PostgresSQL
psql -h my_rds.amazonaws.com -U my_username -d my_dbname -c '\COPY table FROM ''my_file.csv'' CSV HEADER'

I managed to export a very big table (160GB) into CSV files to S3 with:

SELECT * from aws_s3.query_export_to_s3(
    'SELECT * FROM my_schema.my_large_table', 
    aws_commons.create_s3_uri(
        'my_bucket/my_subpath', 
        'my_file.csv', 
        'eu-central-1'
    ),
    options:='format csv'
);

However this ends up in lots of "part files" in S3:

  1. the first one with that same CSV filename my_file.csv
  2. all the others like my_file.csv_part2 ... my_file.csv_part20 and so on

Now, I don't think this is a problem as long as I am able to import back the CSV data somewhere else in AWS RDS Aurora (PostgresSQL). Although I am not sure what strategies could be applied here, if it's better having all these CSV files, or perhaps I can configure the export to use only one huge CSV file (160GB).

Now the import stuff (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.S3Import.html):

Turns out I have to import all these "part files" with PL/pgSQL, but then I get lost in the details on how to format those strings for the S3 paths and in general I see all sorts of errors (both export and import). One file import takes around 20 minutes, so it's quite frustrating figure out what is going wrong.

  1. What's wrong with the source code / error below?
  2. Is there a better way to handle all this export/import at scale (160GB tables)?
DO $$
DECLARE
my_csv_s3_sub_path text;
BEGIN
FOR cnt IN 2..26 LOOP
    my_csv_s3_sub_path := 'my_subpath/my_file.csv_part' || cnt;
    RAISE NOTICE '% START loading CSV file % from S3', now(), cnt;
    SELECT aws_s3.table_import_from_s3(
        'my_schema.my_large_table_new',
        '',
        '(format csv)',
        aws_commons.create_s3_uri(
            'my_bucket', 
            my_csv_s3_sub_path, 
            'eu-central-1'
        )
    );
    RAISE NOTICE '% STOP loading CSV file % from S3', now(), cnt;
END LOOP;
END; $$

The code above gives:

SQL Error [42601]: ERROR: query has no destination for result data
  Hint: If you want to discard the results of a SELECT, use PERFORM instead.
  Where: PL/pgSQL function inline_code_block line 8 at SQL statement

I think it's related to variables and string interpolation because I need to dynamically generate the CSV file name in S3 to be used in the Postgres AWS extension.

But I had all sorts of other errors before, e.g. some export/import inconsistency in the syntax around the S3 bucket sub-path that was leading to the Postgres AWS S3 extension to throw error HTTP 400:

SQL Error [XX000]: ERROR: HTTP 400. Check your arguments and try again. Where: SQL function "table_import_from_s3" statement 1

Is there a better alternative to export/import huge table from/to AWS RDS Aurora PostgresSQL?

CodePudding user response:

The solution was to:

  1. use PERFORM instead of SELECT when running aws_s3.table_import_from_s3 inside a stored procedure,
  2. loop on all the S3 paths to the CSV file parts e.g. my_subpath/my_file.csv_part1 to my_subpath/my_file.csv_part26 (bear in mind there's also the "part 0" my_subpath/my_file.csv)
  3. create the table index AFTER the data I/O above
-- this goes into the loop for all the CSV parts
PERFORM aws_s3.table_import_from_s3(
    'my_schema.my_large_table_new',
    '',
    '(format csv)',
    aws_commons.create_s3_uri(
        'my_bucket', 
        'my_subpath/my_file.csv_part26', 
        'eu-central-1'
    )
);

-- then AFTER the CSV ingestion create the index on the table
CREATE INDEX my_dx ON my_schema.my_large_table_new USING btree (my_column)

This still took 1 day of processing all the CSV files of 6GB each. Not very practical for most scenarios.

For the sake of SQL completeness, make sure the Postgres extension is installed and configured like this:

DROP EXTENSION aws_s3;
DROP EXTENSION aws_commons;
CREATE EXTENSION aws_s3 CASCADE;

You will still have to configure policies, roles and all of that on AWS.

  • Related