Home > Software engineering >  How to decode (from base64) a python np-array and reload it in c as a vector of floats?
How to decode (from base64) a python np-array and reload it in c as a vector of floats?

Time:12-20

In my project I work with word vectors as numpy arrays with a dimension of 300. I want to store the processed arrays in a mongo database, base64 encoded, because this saves a lot of storage space.

Python code

import base64
import numpy as np

vector = np.zeros(300, dtype=np.float32) # represents some word-vector
vector = base64.b64encode(vector) # base64 encoding
# Saving vector to MongoDB...

In MongoDB it is saved in as binary like this. In C I would like to load this binary data as a std::vector. Therefore I have to decode the data first and then load it correctly. I was able to get the binary data into the c program with mongocxx and had it as a uint8_t* with a size of 1600 - but now I don't know what to do and would be happy if someone could help me. Thank you (:

C Code

const bsoncxx::document::element elem_vectors = doc["vectors"];
const bsoncxx::types::b_binary vectors = elemVectors.get_binary();

const uint32_t b_size = vectors.size; // == 1600
const uint8_t* first = vectors.bytes;

// How To parse this as a std::vector<float> with a size of 300?

Solution

I added these lines to my C code and was able to load a vector with 300 elements and all correct values.

    const std::string encoded(reinterpret_cast<const char*>(first), b_size);
    std::string decoded = decodeBase64(encoded);
    std::vector<float> vec(300);
    for (size_t i = 0; i < decoded.size() / sizeof(float);   i) {
        vec[i] = *(reinterpret_cast<const float*>(decoded.c_str()   i * sizeof(float)));
    }

To mention: Thanks to @Holt's info, it is not wise to encode a Numpy array base64 and then store it as binary. Much better to call ".to_bytes()" on the numpy array and then store that in MongoDB, because it reduces the document size from 1.7kb (base64) to 1.2kb (to_bytes()) and then saves computation time because the encoding (and decoding!) doesn't have to be computed!

CodePudding user response:

First, you can't save the storage space by using base64 encoding. On the contrary, it will waste your storage. For an array with 300 floats, the storage is only 300 * 4 = 1200bytes. While after you encode it, the storage will be 1600 bytes! See more about base64 here.

Second, you want to parse the bytes into a vector<float>. You need to decode the bytes if you still use the base64 encoding. I suggest you use some third-party library or try this question. Suppose you already have the decode function.

std::string base64_decode(std::string const& encoded_string); // or something like that.
const std::string encoded(first, b_size);
std::string decoded = base64_decode(encoded);
std::vector<float> vec(300);
for (size_t i = 0; i < decode.size() / sizeof(float);   i) {
    vec[i] = *(decoded.c_str()   i * sizeof(float));
}
  • Related