I am trying to import data into a table in RDS. The data is in parquet file format and is present in s3.
I thought of importing the data from s3
into HDFS using Sqoop
and then exporting it into RDS table using Sqoop. I was able to find the command to export data from HDFS to RDS. But I couldn’t find for importing parquet data from S3. Could you please help on how to structure the sqoop import
command in this case.
CodePudding user response:
You can use spark to copy data from s3 to HDFS.
Read this blog for more details.
CodePudding user response:
The approach that seemed simple and best for me is as below:
- Create a Parquet table in Hive and load it with the Parquet data from S3
create external table if not exists parquet_table(<column name> <column's datatype>) stored as parquet;
LOAD DATA INPATH 's3a://<bucket_name>/<parquet_file>' INTO table parquet_table
- Create a CSV table in Hive and load it with the data from Parquet table
create external table if not exists csv_table(<column name> <column's datatype>)
row format delimited fields terminated by ','
stored as textfile
location 'hdfs:///user/hive/warehouse/csvdata'
- Now that we have a CSV/Textfile Table in Hive, Sqoop can easily export the table from HDFS to MySQL table RDS.
export --table <mysql_table_name> --export-dir hdfs:///user/hive/warehouse/csvdata --connect jdbc:mysql://<host>:3306/<db_name> --username <username> --password-file hdfs:///user/test/mysql.password --batch -m 1 --input-null-string "\\N" --input-null-non-string "\\N" --columns <column names to be exported, without whitespace in between the column names>