Home > Back-end >  How to get columns data from golang apache-arrow?
How to get columns data from golang apache-arrow?

Time:08-26

I am using apache-arrow/go to read parquet data.

I can parse the data to table by using apach-arrow.

    reader, err := ipc.NewReader(buf, ipc.WithAllocator(alloc))
    if err != nil {
        log.Println(err.Error())
        return nil
    }
    defer reader.Release()
    records := make([]array.Record, 0)
    for reader.Next() {
        rec := reader.Record()
        rec.Retain()
        defer rec.Release()
        records = append(records, rec)
    }
    table := array.NewTableFromRecords(reader.Schema(), records)

Here, i can get the column info from table.Colunmn(index), such as:

for i, _ := range table.Schema().Fields() {
            a := table.Column(i)
            log.Println(a)
        }

But the Column struct is defined as

type Column struct {
    field arrow.Field
    data  *Chunked
}

and the println result is like

["WARN" "WARN" "WARN" "WARN" "WARN" "WARN" "WARN" "WARN" "WARN" "WARN"]

However, this is not a string or slice. Is there anyway that i can get the data of each column with string type or []interface{} ?

Update:

I find that i can use reflect to get the element from col.

log.Println(col.(*array.Int64).Value(0))

But i am not sure if this is the recommended way to use it.

CodePudding user response:

When working with Arrow data, there's a couple concepts to understand:

Array: Metadata contiguous buffers of data

Record Batch: A schema a collection of Arrays that are all the same length.

Chunked Array: A group of Arrays of varying lengths but all the same data type. This allows you to treat multiple Arrays as one single column of data without having to copy them all into a contiguous buffer.

Column: Is just a Field a Chunked Array

Table: A collection of Columns allowing you to treat multiple non-contiguous arrays as a single large table without having to copy them all into contiguous buffers.

In your case, you're reading multiple record batches (groups of contiguous Arrays) and treating them as a single large table. There's a few different ways you can work with the data:

One way is to use a TableReader:

tr := array.NewTableReader(tbl, 5)
defer tr.Release()

for tr.Next() {
    rec := tr.Record()
    for i, col := range rec.Columns() {
        // do something with the Array
    }
}

Another way would be to interact with the columns directly as you were in your example:

for i := 0; i < table.NumCols(); i   {
    col := table.Column(i)
    for _, chunk := range col.Data().Chunks() {
        // do something with chunk (an arrow.Array)
    }
}

Either way, you eventually have an arrow.Array to deal with, which is an interface containing one of the typed Array types. At this point you are going to have to switch on something, you could type switch on the type of the Array itself:

switch arr := col.(type) {
case *array.Int64:
    // do stuff with arr
case *array.Int32:
    // do stuff with arr
case *array.String:
    // do stuff with arr
...
}

Alternately, you could type switch on the data type:

switch col.DataType().ID() {
case arrow.INT64:
    // type assertion needed col.(*array.Int64)
case arrow.INT32:
    // type assertion needed col.(*array.Int32)
...
}

For getting the data out of the array, primitive types which are stored contiguously tend to have a *Values method which will return a slice of the type. For example array.Int64 has Int64Values() which returns []int64. Otherwise, all of the types have .Value(int) methods which return the value at a particular index as you showed in your example.

Hope this helps!

CodePudding user response:

  1. Make sure you use v9 (import "github.com/apache/arrow/go/v9/arrow") because it have implemented json.Marshaller (from go-json)
  2. Use "github.com/goccy/go-json" for Marshaler (because of this)

Then you can use TableReader to Marshal it then Unmarshal with type []any

In your example maybe look like this:

import (
    "github.com/apache/arrow/go/v9/arrow"
    "github.com/apache/arrow/go/v9/arrow/array"
    "github.com/apache/arrow/go/v9/arrow/memory"
    "github.com/goccy/go-json"
)

    ...
    tr := array.NewTableReader(tabel, 6)
    defer tr.Release()
    // fmt.Printf("tbl.NumRows() = % v\n", tbl.NumRows())
    // fmt.Printf("tbl.NumColumn = % v\n", tbl.NumCols())

    // keySlice is for sorting same as data source
    keySlice := make([]string, 0, tabel.NumCols())

    res := make(map[string][]any, 0)
    var key string
    for tr.Next() {
        rec := tr.Record()

        for i, col := range rec.Columns() {
            key = rec.ColumnName(i)
            if res[key] == nil {
                res[key] = make([]any, 0)
                keySlice = append(keySlice, key)
            }
            var tmp []any
            b2, err := json.Marshal(col)
            if err != nil {
                panic(err)
            }
            err = json.Unmarshal(b2, &tmp)
            if err != nil {
                panic(err)
            }
            // fmt.Printf("key = %s\n", key)
            // fmt.Printf("tmp = % v\n", tmp)
            res[key] = append(res[key], tmp...)
        }
    }

    fmt.Println("res", res)

  • Related