Home > Software design >  Querying a large dataset in-browser using webassembly
Querying a large dataset in-browser using webassembly

Time:09-30

For argument's sake, let's say that a browser allows 4GB of memory in WebAssembly applications. Ignoring compression and other data-storage considerations, if a user had a 3GB local csv file, we could query that data entirely in-memory using webassembly (or javascript, of course). For example, if the user's data was of the following format:

ID Country Amount
1 US 12
2 GB 11
3 DE 7

Then in a few lines of code we could do a basic algorithm to filter to ID=2, i.e., the SQL equivalent of SELECT * FROM table WHERE id=2.

Now, my question is whether it's possible in any browser (and possibly with experimental flags and/or certain user preferences selected) such that a query could be done on a file that would not fit in memory, even if properly compressed. For example, in this blog post, a ~500GB file is loaded and then queried. I know that the 500GB of data is not loaded entirely in memory, and there's probably a column-oriented data structure so that only certain columns need to be read, but either way the OS has access to the file system and so files much larger than available memory can be used.

Is this possible to do in any way within a webassembly browser application? If so, what would be an outline of how it could be done? I know this question might require some research, so when it's available for a bounty I can add a 500-point bounty to it to encourage answers. (Note that the underlying language being used is C -compiled-to-wasm, but I don't think that should matter for this question.)

I suppose one possibility might be along the lines of something like: https://rreverser.com/webassembly-shell-with-a-real-filesystem-access-in-a-browser/.

CodePudding user response:

Javascript File API

By studying the File API it turns out that when reading a file the browser will always handle you a Blob.This gives the impression that all the file is fetched by the browser to the RAM. The Blob has also a .stream() function that returns a ReadableStream to stream the very same Blob.

It turns out (at least in Chrome) that the handled Blob is virtual and the underlying file is not loaded until requested. Nor file object slicing nor an instantiating a reader loads the entire file:

file.slice(file.size - 100)
(await reader.read()).value.slice(0, 100)

Here is a test Sandbox and the sourcecode

The example lets you select a file ad will display the last 100 characters (using .slice()) and the first 100 by using the ReadableStream (note that the stream function does not have seek functionality)

I've tested this up to 10GB (the largest .csv I have laying around) and no RAM gets consumed by the browser

This answers the first part of the question. With the capability to stream (or perform chunked access) a file without consuming RAM you can consume an arbitrarily large file and search for your content (binary search or a table scan).

Webassembly

In Rust using stdweb there is no .read() function (hence the content can not be streamed). But File does have .slice() function to slice the underlying blob (same as in javascript). This is a minimal working example:

#[macro_use]
extern crate stdweb;

use stdweb::js_export;

use std::convert::From;
use stdweb::web::IBlob;
use stdweb::web::File;
use stdweb::web::FileReader;
use stdweb::web::FileReaderResult;

#[js_export]
fn read_file(file: File) {
    let blob = file.slice(..2048);
    let len = stdweb::Number::from(blob.len() as f64);

    js! {
        var _len = @{len};
        console.log("length="   _len);
        var _blob = @{blob};
        console.log(_blob);
    }
}

fn main() {
}
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>WASM</title>
</head>
<body>
    <input type="file" id="field" />

    <script src="the_compiled_wasm_binding.js"></script>
    <script>
        async function onChange(e) {
            const files = e.target.files;
            if (files.length === 0) return;
            const file = files[0];

            // Slice
            Rust.the_compiled_wasm_binding.then(module => {
                module.read_file(file);
            })
        }

        document.getElementById("field").onchange = onChange;
    </script>
</body>
</html>

The .slice() function is behaving the same as in javascript (the entire file is NOT loaded in RAM) hence you can load chunks of the file in WASM and perform a search.

Please note that stdweb implementation of slice() uses slice_blob() which internally performs:

js! (
    return @{reference}.slice(@{start}, @{end}, @{content_type});
).try_into().unwrap()

As you can see it uses the javascript under the hood, so no optimization here.

Conclusions

IMHO the file reading implementation is more effective in javascript due to:

  • stdweb::File API using raw javascript under the hood (hence not being faster)
  • stdweb::File having less functionalities than the javascript counterpart (lack of streaming and few other functions).

Then indeed the searching algorithm could/should be implemented in WASM. The algorithm can be handled directly a chunk (Blob) to be processed.

  • Related