I have larger-than-memory uniform (regularly gridded) 2d binary data which I am trying to interactively plot using any combination of Dask, Datashader and Holoviews. I am open to using other python-based tools, but the internet has led me to these ones for now.
The data files are ~11 GB and consist of a (600000, 4800) array of float32s.
I want to plot them on a different aspect ratio (1000x1000 px), and have a callback handle the dataloading/shading on zoom/pan. I am serving to a browser instead of using notebooks.
Within a 1000x1000px datashader canvas I have plotted:
- 4800x4800 points (which filled the canvas)
- 600000x4800 points (which filled only the bottom few pixels of the canvas, since the colored pixels had an aspect ratio of 600000/4800)
Neither were interactive.
What I have to far using python3.10 is:
import numpy as np
import datashader as ds
from datashader import transfer_functions as tf
import xarray as xr
import holoviews as hv
import panel as pn
hv.extension('bokeh', logo=False)
hv.output(backend="bokeh")
filename = 'path/to/binary/datafile'
arr = np.memmap(filename, shape=(4800,600000), offset=0, dtype=np.dtype("f4"), mode='r')
arr = xr.DataArray(arr, dims=("x", "y"), coords={'x': np.arange(4800), "y": np.arange(600000)})
cvs = ds.Canvas(plot_width=1000, plot_height=1000, x_range=(0, 4800), y_range=(0, 4800))
# the following line works too but does not fill the canvas
# cvs = ds.Canvas(plot_width=1000, plot_height=1000, x_range=(0, 4800), y_range=(0, 600000))
agg = cvs.raster(arr)
sh = tf.shade(agg)
pn.Row(sh).show()
Any advice is appreciated!
CodePudding user response:
I'm not sure precisely what the ask is here, but the HoloViz way of approaching this problem would be to use dask without .persist()
or .compute()
. The np.memmap approach may also work.
And then you'd use holoviews as described at https://examples.pyviz.org/census/census.html, or hvplot as described at https://hvplot.holoviz.org . Without having the actual data or a synthesized version of it it's hard to be more specific than that.
BTW, I think you have x and y switched in your x_range and y_range above, since a Numpy shape of 4800,600000 corresponds to a y_range of 0,4800 and an x_range of 0,600000 (since NumPy shapes are row, column while row is on y and column is on x).
CodePudding user response:
def: "external" means "out-of-core storage", typically on disk or SSD, e.g. "external sort".
The usual approach to such external data retrieval tasks is to throw it in an RDBMS or similar datastore, and make it the database's problem rather than yours.
There's at least two interesting problems going on here.
- large 2D dataset
- render in different aspect ratio than original data
Consider these approaches.
- Favor postgres btree gist over a one-dimensional index
- Consider pre-processing the original data into one or more additional aspect ratios that will be more suitable for the rendering requirements.
There is very nice PostGIS support for large 2D datasets.
As far as interactive display goes? Well, streamlit is better at "early prototype" than "polished production app", but you should certainly give it a whirl for your current set of requirements. Very easy to get started. Limitations (by design!) on just how fancy you can get with it.
The st.cache
support may address
some of your "interactive" requirements,
decoupling data retrieval from interaction.
When you've implemented something that addresses these needs to your satisfaction, please describe it here.