Home > Blockchain >  Can we read parquet file in Hazelcast jet?
Can we read parquet file in Hazelcast jet?

Time:05-05

I trying to read parquet file via Hazelcast for that I have written below code which is working fine, but do Hazelcast provide any in-build source to read parquet file?

BatchSource<Object> csvData = SourceBuilder
        .batch("parquet-source", x -> {                               
            try {
    
                ParquetReader<GenericData.Record> reader = AvroParquetReader.<GenericData.Record>builder(new Path("D:/test/1651070287920.parquet")).build();
                return reader;
                
            } catch (Exception e) {
                
                return null;
            }
        })
        .<Object>fillBufferFn((reader, buf) -> {                          
            try {
                GenericRecord record;
            if ((record = reader.read()) != null) {
            
                Map<String, String> map = new HashMap<>();
                for (int i = 0; i < headers[0].length; i  ) {
                        String value = record.get(i) == null ? "" : record.get(i).toString();
                        map.put(headers[0][i], value);
                }
            
                
                    if (map != null) {
                        rowcount = rowcount   1; 
                        buf.add(map);
                    }
            } else {
                buf.close();
                return;
            }
            } catch (Exception e) {
                buf.close();
                return;
            }

        })
        .build();

Please let me know if there is already any source in Hazelcast Jet.

CodePudding user response:

Parquet files using Avro for serialization can be read using the Unified File Connector. See also the code sample.

CodePudding user response:

Parquet is supported using the unified file connector

BatchSource<SpecificUser> source = FileSources.files("/data")
  .glob("users.parquet")
  .format(FileFormat.<SpecificUser>parquet())
  .useHadoopForLocalFiles(true)
  .build();

If you don't have a class corresponding to your schema, or your schema is more dynamic and you want to return org.apache.avro.generic.GenericRecord from the source, which you can then map to Map<String, String> you can use the following:

BatchSource<GenericRecord> source = FileSources.files(currentDir   "/target/parquet")
  .glob("file.parquet")
  .option("avro.serialization.data.model", GenericData.class.getName())                                                    
  .useHadoopForLocalFiles(true)
  .format(FileFormat.parquet())
  .build();
  • Related