Home > front end >  How to import parquet data from S3 into HDFS using Sqoop?
How to import parquet data from S3 into HDFS using Sqoop?

Time:09-22

I am trying to import data into a table in RDS. The data is in parquet file format and is present in s3. I thought of importing the data from s3 into HDFS using Sqoop and then exporting it into RDS table using Sqoop. I was able to find the command to export data from HDFS to RDS. But I couldn’t find for importing parquet data from S3. Could you please help on how to structure the sqoop import command in this case.

CodePudding user response:

You can use spark to copy data from s3 to HDFS.

Read this blog for more details.

CodePudding user response:

The approach that seemed simple and best for me is as below:

  • Create a Parquet table in Hive and load it with the Parquet data from S3
create external table if not exists parquet_table(<column name> <column's datatype>) stored as parquet;

LOAD DATA INPATH 's3a://<bucket_name>/<parquet_file>' INTO table parquet_table
  • Create a CSV table in Hive and load it with the data from Parquet table
create external table if not exists csv_table(<column name> <column's datatype>)
row format delimited fields terminated by ','
stored as textfile
location 'hdfs:///user/hive/warehouse/csvdata'
  • Now that we have a CSV/Textfile Table in Hive, Sqoop can easily export the table from HDFS to MySQL table RDS.
export --table <mysql_table_name> --export-dir hdfs:///user/hive/warehouse/csvdata --connect jdbc:mysql://<host>:3306/<db_name> --username <username> --password-file hdfs:///user/test/mysql.password --batch -m 1 --input-null-string "\\N" --input-null-non-string "\\N" --columns <column names to be exported, without whitespace in between the column names>
  • Related