Can we read parquet file in Hazelcast jet?-CodePudding

I trying to read parquet file via Hazelcast for that I have written below code which is working fine, but do Hazelcast provide any in-build source to read parquet file?

BatchSource<Object> csvData = SourceBuilder
        .batch("parquet-source", x -> {                               
            try {
    
                ParquetReader<GenericData.Record> reader = AvroParquetReader.<GenericData.Record>builder(new Path("D:/test/1651070287920.parquet")).build();
                return reader;
                
            } catch (Exception e) {
                
                return null;
            }
        })
        .<Object>fillBufferFn((reader, buf) -> {                          
            try {
                GenericRecord record;
            if ((record = reader.read()) != null) {
            
                Map<String, String> map = new HashMap<>();
                for (int i = 0; i < headers[0].length; i  ) {
                        String value = record.get(i) == null ? "" : record.get(i).toString();
                        map.put(headers[0][i], value);
                }
            
                
                    if (map != null) {
                        rowcount = rowcount   1; 
                        buf.add(map);
                    }
            } else {
                buf.close();
                return;
            }
            } catch (Exception e) {
                buf.close();
                return;
            }

        })
        .build();

Please let me know if there is already any source in Hazelcast Jet.

CodePudding user response：

Parquet files using Avro for serialization can be read using the Unified File Connector. See also the code sample.

CodePudding user response：

Parquet is supported using the unified file connector

BatchSource<SpecificUser> source = FileSources.files("/data")
  .glob("users.parquet")
  .format(FileFormat.<SpecificUser>parquet())
  .useHadoopForLocalFiles(true)
  .build();

If you don't have a class corresponding to your schema, or your schema is more dynamic and you want to return org.apache.avro.generic.GenericRecord from the source, which you can then map to Map<String, String> you can use the following:

BatchSource<GenericRecord> source = FileSources.files(currentDir   "/target/parquet")
  .glob("file.parquet")
  .option("avro.serialization.data.model", GenericData.class.getName())                                                    
  .useHadoopForLocalFiles(true)
  .format(FileFormat.parquet())
  .build();