I am having trouble using AvroParquetReader
inside a Flink Application. (flink>=1.15)
Motivaton (AKA why I want to use it)
According to official doc one can read Parquet files in Flink into FileSource
. However, I only want to write a function to load parquet file into Avro records without creating a DataStreamSource
. In particular, I want to load parquet files into FileInputFormat
which is a complete separate API (for some weird reasons). (And I could not see easily how one could cast BulkFormat
or StreamFormat
into it, if one dig one level deeper.)
Therefore, it would much simpler if one use org.apache.parquet.avro.AvroParquetReader
to read it directly.
Error description
However, I found this error after run the Flink application locally: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
.
This is quite unexpected, since the flink-s3-hadoop-fs
jar has already been loaded inside the plugin system (and the file path has already been added to HADOOP_CLASSPATH
as well). So not only flink knows where it is, so should the local hadoop as well.
Comments:
- Without this
AvroParquetReader
, the Flink app can write to S3 without problem. - The Hadoop is not a flink shaded one, but installed separately with version 2.10.
Would love to hear if you have some insights about this.
ParquetAvroReader should be able to read the parquet files without problem.
CodePudding user response:
there is an official hadoop
guide that has some potential fixes for the issue and can be found here. If I recall correnctly this issue was cause by some Hadoop AWS dependencies missing.