string vs integer as map key for memory utilization in golang?-CodePudding

I have a below read function which is called by multiple go routines to read s3 files and it populates two concurrent map as shown below.

During server startup, it calls read function below to populate two concurrent map.
And also periodically every 30 seconds, it calls read function again to read new s3 files and populate two concurrent map again with some new data.

So basically at a given state of time during the whole lifecycle of this app, both my concurrent map have some data and also periodically being updated too.

func (r *clientRepository) read(file string, bucket string) error {
    var err error
    //... read s3 file

    for {
        rows, err := pr.ReadByNumber(r.cfg.RowsToRead)
        if err != nil {
            return errs.Wrap(err)
        }
        if len(rows) <= 0 {
            break
        }

        byteSlice, err := json.Marshal(rows)
        if err != nil {
            return errs.Wrap(err)
        }
        var productRows []ParquetData
        err = json.Unmarshal(byteSlice, &productRows)
        if err != nil {
            return errs.Wrap(err)
        }

        for i := range productRows {
            var flatProduct definitions.CustomerInfo
            err = r.ConvertData(spn, &productRows[i], &flatProduct)
            if err != nil {
                return errs.Wrap(err)
            }

            // populate first concurrent map here
            r.products.Set(strconv.FormatInt(flatProduct.ProductId, 10), &flatProduct)
            for _, catalogId := range flatProduct.Catalogs {
                strCatalogId := strconv.FormatInt(int64(catalogId), 10)
                // upsert second concurrent map here
                r.productCatalog.Upsert(strCatalogId, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
                    productID := newValue.(int64)
                    if valueInMap == nil {
                        return map[int64]struct{}{productID: {}}
                    }
                    oldIDs := valueInMap.(map[int64]struct{})
                    // value is irrelevant, no need to check if key exists
                    oldIDs[productID] = struct{}{}
                    return oldIDs
                })
            }
        }
    }
    return nil
}

In above code flatProduct.ProductId and strCatalogId are integer but I am converting them into string bcoz concurrent map works with string only. And then I have below three functions which is used by my main application threads to get data from the concurrent map populated above.

func (r *clientRepository) GetProductMap() *cmap.ConcurrentMap {
    return r.products
}

func (r *clientRepository) GetProductCatalogMap() *cmap.ConcurrentMap {
    return r.productCatalog
}

func (r *clientRepository) GetProductData(pid string) *definitions.CustomerInfo {
    pd, ok := r.products.Get(pid)
    if ok {
        return pd.(*definitions.CustomerInfo)
    }
    return nil
}

I have a use case where I need to populate map from multiple go routines and then read data from those maps from bunch of main application threads so it needs to be thread safe and it should be fast enough as well without much locking.

Problem Statement

I am dealing with lots of data like 30-40 GB worth of data from all these files which I am reading into memory. I am using concurrent map here which solves most of my concurrency issues but the key for the concurrent map is string and it doesn't have any implementation where key can be integer. In my case key is just a product id which can be int32 so is it worth it storing all those product id's as string in this concurrent map? I think string allocation takes more memory compare to storing all those keys as integer? At least it does in c/c so I am assuming it should be same case here in golang too.

Is there anything I can to improve here w.r.t map usage so that I can reduce memory utilization plus I don't lose performance as well while reading data from these maps from main threads?

I am using concurrent map from this repo which doesn't have implementation for key as integer.

CodePudding user response：

I don't know the implementation details of concurrent map in Go, but if it's using a string as a key I'm guessing that behind the scenes it's storing both the string and a hash of the string (which will be used for actual indexing operations).

That is going to be something of a memory hog, and there'll be nothing that can be done about that as concurrent map uses only strings for key.

If there were some sort of map that did use integers, it'd likely be using hashes of those integers anyway. A smooth hash distribution is a necessary feature for good and uniform lookup performance, in the event that key data itself is not uniformly distributed. It's almost like you need a very simple map implementation!

I'm wondering if a simple array would do, if your product ID's fit within 32bits (or can be munged to do so, or down to some other acceptable integer length). Yes, that way you'd have a large amount of memory allocated, possibly with large tracts unused. However, indexing is super-rapid, and the OS's virtual memory subsystem would ensure that areas of the array that you don't index aren't swapped in. Caveat - I'm thinking very much in terms of C and fixed-size objects here - less so Go - so this may be a bogus suggestion.

To persevere, so long as there's nothing about the array that implies initialisation-on-allocation (e.g. in C the array wouldn't get initialised by the compiler), allocation doesn't automatically mean it's all in memory, all at once, and only the most commonly used areas of the array will be in RAM courtesy of the OS's virtual memory subsystem.

EDIT

You could have a map of arrays, where each array covered a range of product Ids. This would be close to the same effect, trading off storage of hashes and strings against storage of null references. If product ids are clumped in some sort of structured way, this could work well.

Also, just a thought, and I'm showing a total lack of knowledge of Go here. Does Go store objects by reference? In which case wouldn't an array of objects actually be an array of references (so, fixed in size) and the actual objects allocated only as needed (ie a lot of the array is null references)? That doesn't sound good for my one big array suggestion...

CodePudding user response：

The library you use is relatively simple and you may just replace all string into int32 (and modify the hashing function) and it will still work fine.

I ran a tiny (and not that rigorous) benchmark against the replaced version:

$ go test -bench=. -benchtime=10x -benchmem
goos: linux
goarch: amd64
pkg: maps
BenchmarkCMapAlloc-4             10  174272711 ns/op  49009948 B/op    33873 allocs/op
BenchmarkCMapAllocSS-4           10  369259624 ns/op 102535456 B/op  1082125 allocs/op
BenchmarkCMapUpdateAlloc-4       10  114794162 ns/op         0 B/op        0 allocs/op
BenchmarkCMapUpdateAllocSS-4     10  192165246 ns/op  16777216 B/op  1048576 allocs/op
BenchmarkCMap-4                  10 1193068438 ns/op      5065 B/op       41 allocs/op
BenchmarkCMapSS-4                10 2195078437 ns/op 536874022 B/op 33554471 allocs/op

Benchmarks with a SS suffix is the original string version. So using integers as keys takes less memory and runs faster, as anyone would expect. The string version allocates about 50 bytes more each insertion. (This is not the actual memory usage though.)

Basically, a string in go is just a struct:

type stringStruct struct {
    str unsafe.Pointer
    len int
}

So on a 64-bit machine, it takes at least 8 bytes (pointer) 8 bytes (length) len(underlying bytes) bytes to store a string. Turning it into a int32 or int64 will definitely save memory. However, I assume that CustomerInfo and the catalog sets takes the most memory and I don't think there will be a great improvement.

(By the way, tuning the SHARD_COUNT in the library might also help a bit.)