I am working on reading parquet file as shown below. Below code read parquet file and converts them to ParquetProduct
struct which I use it later on to get data out of it.
func (r *clientRepository) read(logg log.Prot, file string, bucket string) error {
var err error
fr, err := pars3.NewS3FileReader(context.Background(), bucket, file, r.s3Client.GetSession().Config)
if err != nil {
return errs.Wrap(err)
}
defer xio.CloseIgnoringErrors(fr)
pr, err := reader.NewParquetReader(fr, nil, int64(r.cfg.Workers))
if err != nil {
return errs.Wrap(err)
}
if pr.GetNumRows() == 0 {
logg.Infof("Skipping %s due to 0 rows", file)
return nil
}
for {
rows, err := pr.ReadByNumber(r.cfg.RowsToRead)
if err != nil {
return errs.Wrap(err)
}
if len(rows) <= 0 {
break
}
// doing Marshal here first
byteSlice, err := json.Marshal(rows)
if err != nil {
return errs.Wrap(err)
}
var productRows []ParquetProduct
// and then Unmarshal here
err = json.Unmarshal(byteSlice, &productRows)
if err != nil {
return errs.Wrap(err)
}
//.....
// use productRows here
//.....
}
return nil
}
Problem Statement
I am doing Marshal
first and then Unmarshalling
to get the required object. Is there any way to avoid all this. ReadByNumber
function (of parquet-go
library) returns []interface{}
so is there anyway to get my []ParquetProduct
struct back just from the []interface{}
?
I am using go 1.19
. This is the library I am using to read parquet
file - https://github.com/xitongsys/parquet-go
Is there any better and efficient way to do this overall?
CodePudding user response:
Instead of using ReadByNumer
, make a slice of []ParquetProduct
with the desired length and use Read
.
products := make([]ParquetProduct, r.cfg.RowsToRead)
// ^ slice with length and capacity equal to r.cfg.RowsToRead
err = pr.Read(&products)
if err != nil {
// ...
}