I am using apache-arrow/go to read parquet data.
I can parse the data to table by using apach-arrow.
reader, err := ipc.NewReader(buf, ipc.WithAllocator(alloc))
if err != nil {
log.Println(err.Error())
return nil
}
defer reader.Release()
records := make([]array.Record, 0)
for reader.Next() {
rec := reader.Record()
rec.Retain()
defer rec.Release()
records = append(records, rec)
}
table := array.NewTableFromRecords(reader.Schema(), records)
Here, i can get the column info from table.Colunmn(index), such as:
for i, _ := range table.Schema().Fields() {
a := table.Column(i)
log.Println(a)
}
But the Column struct is defined as
type Column struct {
field arrow.Field
data *Chunked
}
and the println result is like
["WARN" "WARN" "WARN" "WARN" "WARN" "WARN" "WARN" "WARN" "WARN" "WARN"]
However, this is not a string or slice. Is there anyway that i can get the data of each column with string type or []interface{} ?
Update:
I find that i can use reflect to get the element from col.
log.Println(col.(*array.Int64).Value(0))
But i am not sure if this is the recommended way to use it.
CodePudding user response:
When working with Arrow data, there's a couple concepts to understand:
Array: Metadata contiguous buffers of data
Record Batch: A schema a collection of Arrays that are all the same length.
Chunked Array: A group of Arrays of varying lengths but all the same data type. This allows you to treat multiple Arrays as one single column of data without having to copy them all into a contiguous buffer.
Column: Is just a Field a Chunked Array
Table: A collection of Columns allowing you to treat multiple non-contiguous arrays as a single large table without having to copy them all into contiguous buffers.
In your case, you're reading multiple record batches (groups of contiguous Arrays) and treating them as a single large table. There's a few different ways you can work with the data:
One way is to use a TableReader:
tr := array.NewTableReader(tbl, 5)
defer tr.Release()
for tr.Next() {
rec := tr.Record()
for i, col := range rec.Columns() {
// do something with the Array
}
}
Another way would be to interact with the columns directly as you were in your example:
for i := 0; i < table.NumCols(); i {
col := table.Column(i)
for _, chunk := range col.Data().Chunks() {
// do something with chunk (an arrow.Array)
}
}
Either way, you eventually have an arrow.Array
to deal with, which is an interface containing one of the typed Array types. At this point you are going to have to switch on something, you could type switch on the type of the Array itself:
switch arr := col.(type) {
case *array.Int64:
// do stuff with arr
case *array.Int32:
// do stuff with arr
case *array.String:
// do stuff with arr
...
}
Alternately, you could type switch on the data type:
switch col.DataType().ID() {
case arrow.INT64:
// type assertion needed col.(*array.Int64)
case arrow.INT32:
// type assertion needed col.(*array.Int32)
...
}
For getting the data out of the array, primitive types which are stored contiguously tend to have a *Values
method which will return a slice of the type. For example array.Int64
has Int64Values()
which returns []int64
. Otherwise, all of the types have .Value(int)
methods which return the value at a particular index as you showed in your example.
Hope this helps!
CodePudding user response:
- Make sure you use v9
(
import "github.com/apache/arrow/go/v9/arrow"
) because it have implemented json.Marshaller (from go-json) - Use
"github.com/goccy/go-json"
for Marshaler (because of this)
Then you can use TableReader
to Marshal it then Unmarshal with type []any
In your example maybe look like this:
import (
"github.com/apache/arrow/go/v9/arrow"
"github.com/apache/arrow/go/v9/arrow/array"
"github.com/apache/arrow/go/v9/arrow/memory"
"github.com/goccy/go-json"
)
...
tr := array.NewTableReader(tabel, 6)
defer tr.Release()
// fmt.Printf("tbl.NumRows() = % v\n", tbl.NumRows())
// fmt.Printf("tbl.NumColumn = % v\n", tbl.NumCols())
// keySlice is for sorting same as data source
keySlice := make([]string, 0, tabel.NumCols())
res := make(map[string][]any, 0)
var key string
for tr.Next() {
rec := tr.Record()
for i, col := range rec.Columns() {
key = rec.ColumnName(i)
if res[key] == nil {
res[key] = make([]any, 0)
keySlice = append(keySlice, key)
}
var tmp []any
b2, err := json.Marshal(col)
if err != nil {
panic(err)
}
err = json.Unmarshal(b2, &tmp)
if err != nil {
panic(err)
}
// fmt.Printf("key = %s\n", key)
// fmt.Printf("tmp = % v\n", tmp)
res[key] = append(res[key], tmp...)
}
}
fmt.Println("res", res)