How to import parquet data from S3 into HDFS using Sqoop?-CodePudding

I am trying to import data into a table in RDS. The data is in parquet file format and is present in s3. I thought of importing the data from s3 into HDFS using Sqoop and then exporting it into RDS table using Sqoop. I was able to find the command to export data from HDFS to RDS. But I couldn’t find for importing parquet data from S3. Could you please help on how to structure the sqoop import command in this case.

CodePudding user response：

You can use spark to copy data from s3 to HDFS.

Read this blog for more details.

CodePudding user response：

The approach that seemed simple and best for me is as below:

Create a Parquet table in Hive and load it with the Parquet data from S3

create external table if not exists parquet_table(<column name> <column's datatype>) stored as parquet;

LOAD DATA INPATH 's3a://<bucket_name>/<parquet_file>' INTO table parquet_table

Create a CSV table in Hive and load it with the data from Parquet table

create external table if not exists csv_table(<column name> <column's datatype>)
row format delimited fields terminated by ','
stored as textfile
location 'hdfs:///user/hive/warehouse/csvdata'

Now that we have a CSV/Textfile Table in Hive, Sqoop can easily export the table from HDFS to MySQL table RDS.

export --table <mysql_table_name> --export-dir hdfs:///user/hive/warehouse/csvdata --connect jdbc:mysql://<host>:3306/<db_name> --username <username> --password-file hdfs:///user/test/mysql.password --batch -m 1 --input-null-string "\\N" --input-null-non-string "\\N" --columns <column names to be exported, without whitespace in between the column names>