Loading a FastText Model from s3 without Saving Locally-CodePudding

I am looking to use a FastText model in a ML pipeline that I made and saved as a .bin file on s3. My hope is to keep this all in a cloud based pipeline, so I don't want local files. I feel like I am really close, but I can't figure out how to make a temporary .bin file. I also am not sure if I am saving and reading the FastText model in the most efficient way. The below code works, but it saves the file locally which I want to avoid.

import smart_open
file = smart_open.smart_open(s3 location of .bin model)
listed = b''.join([i for i in file])
with open("ml_model.bin", "wb") as binary_file:
    binary_file.write(listed)
model = fasttext.load_model("ml_model.bin")

CodePudding user response：

If you want to use the fasttext wrapper for the official Facebook FastText code, you may need to create a local temporary copy - your troubles make it seem like that code relies on opening a local file path.

You could also try the Gensim package's separate FastText support, which should accept an S3 path via its load_facebook_model() function:

https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model

(Note, though, that Gensim doesn't support all FastText functionality, like the supervised mode.)

CodePudding user response：

As partially answered by the above response, a temporary file was needed. But on top of that, the temporary file needed to be passed as a string object, which is sort of strange. Working code below:

import tempfile
import fasttext
import smart_open
from pathlib import Path

file = smart_open.smart_open(f's3://{bucket_name}/{key}')
listed = b''.join([i for i in file])
with tempfile.TemporaryDirectory() as tdir:
    tfile = Path(tdir).joinpath('tempfile.bin')
    tfile.write_bytes(listed)
    model = fasttext.load_model(str(tfile))