I trying to read parquet file via Hazelcast for that I have written below code which is working fine, but do Hazelcast provide any in-build source to read parquet file?
BatchSource<Object> csvData = SourceBuilder
.batch("parquet-source", x -> {
try {
ParquetReader<GenericData.Record> reader = AvroParquetReader.<GenericData.Record>builder(new Path("D:/test/1651070287920.parquet")).build();
return reader;
} catch (Exception e) {
return null;
}
})
.<Object>fillBufferFn((reader, buf) -> {
try {
GenericRecord record;
if ((record = reader.read()) != null) {
Map<String, String> map = new HashMap<>();
for (int i = 0; i < headers[0].length; i ) {
String value = record.get(i) == null ? "" : record.get(i).toString();
map.put(headers[0][i], value);
}
if (map != null) {
rowcount = rowcount 1;
buf.add(map);
}
} else {
buf.close();
return;
}
} catch (Exception e) {
buf.close();
return;
}
})
.build();
Please let me know if there is already any source in Hazelcast Jet.
CodePudding user response:
Parquet files using Avro for serialization can be read using the Unified File Connector. See also the code sample.
CodePudding user response:
Parquet is supported using the unified file connector
BatchSource<SpecificUser> source = FileSources.files("/data")
.glob("users.parquet")
.format(FileFormat.<SpecificUser>parquet())
.useHadoopForLocalFiles(true)
.build();
If you don't have a class corresponding to your schema, or your schema is more dynamic and you want to return org.apache.avro.generic.GenericRecord
from the source, which you can then map
to Map<String, String>
you can use the following:
BatchSource<GenericRecord> source = FileSources.files(currentDir "/target/parquet")
.glob("file.parquet")
.option("avro.serialization.data.model", GenericData.class.getName())
.useHadoopForLocalFiles(true)
.format(FileFormat.parquet())
.build();