In my project I work with word vectors as numpy arrays with a dimension of 300. I want to store the processed arrays in a mongo database, base64 encoded, because this saves a lot of storage space.
Python code
import base64
import numpy as np
vector = np.zeros(300, dtype=np.float32) # represents some word-vector
vector = base64.b64encode(vector) # base64 encoding
# Saving vector to MongoDB...
In MongoDB it is saved in as binary like this. In C I would like to load this binary data as a std::vector. Therefore I have to decode the data first and then load it correctly. I was able to get the binary data into the c program with mongocxx and had it as a uint8_t* with a size of 1600 - but now I don't know what to do and would be happy if someone could help me. Thank you (:
C Code
const bsoncxx::document::element elem_vectors = doc["vectors"];
const bsoncxx::types::b_binary vectors = elemVectors.get_binary();
const uint32_t b_size = vectors.size; // == 1600
const uint8_t* first = vectors.bytes;
// How To parse this as a std::vector<float> with a size of 300?
Solution
I added these lines to my C code and was able to load a vector with 300 elements and all correct values.
const std::string encoded(reinterpret_cast<const char*>(first), b_size);
std::string decoded = decodeBase64(encoded);
std::vector<float> vec(300);
for (size_t i = 0; i < decoded.size() / sizeof(float); i) {
vec[i] = *(reinterpret_cast<const float*>(decoded.c_str() i * sizeof(float)));
}
To mention: Thanks to @Holt's info, it is not wise to encode a Numpy array base64 and then store it as binary. Much better to call ".to_bytes()" on the numpy array and then store that in MongoDB, because it reduces the document size from 1.7kb (base64) to 1.2kb (to_bytes()) and then saves computation time because the encoding (and decoding!) doesn't have to be computed!
CodePudding user response:
First, you can't save the storage space by using base64 encoding. On the contrary, it will waste your storage. For an array with 300 floats, the storage is only 300 * 4 = 1200bytes. While after you encode it, the storage will be 1600 bytes! See more about base64 here.
Second, you want to parse the bytes into a vector<float>
. You need to decode the bytes if you still use the base64 encoding. I suggest you use some third-party library or try this question. Suppose you already have the decode function.
std::string base64_decode(std::string const& encoded_string); // or something like that.
const std::string encoded(first, b_size);
std::string decoded = base64_decode(encoded);
std::vector<float> vec(300);
for (size_t i = 0; i < decode.size() / sizeof(float); i) {
vec[i] = *(decoded.c_str() i * sizeof(float));
}