Pandas dataframe containing geographical data conversion to array takes a lot of time-CodePudding

I have a pandas data frame like this

lat lon value
10   10  1

this data frame has 7 million datapoints

I want to convert this to an array so that finally I can convert them to a net cdf file There are two ways I did this

convert dataframe to a point shapefile using gdal and then shapefile to a raster using qgis. This takes barely 3-4 minutes(8 core M1 processor) but there is a minor loss of information
Convert the pandas' data frame to an array and write the array to a .nc file. This according to my estimate will take 120 hours in 18 core intel cpu on a supercomputer. (The code is parallelized using joblib.)

The code looks something like this

lati=np.round(np.linspace(np.min(df.lat),np.max(df.lat),lat_range 1),2)
loni=np.round(np.linspace(np.min(df.lon),np.max(df.lon),lon_range 1),2)
target_column = 'soil_moisture'
search_columns = ['lat','lon']
df_temp = df.set_index(search_columns)
def func(i,j):
    latitude= lati[i]
    longitude=loni[j]
    search_values = [latitude, longitude]
    value = df_temp.loc[tuple(search_values), target_column]
    return(value)

from joblib import Parallel, delayed
results= Parallel(n_jobs=-1, verbose=2)(delayed(func)(i, j) for i in range(lat_range 1) for j in range(lon_range 1)) 
m=np.reshape(results, (lat_range 1,lon_range 1))

I have tested the code on a dummy dataset and it works fine but on the original dataset, it takes a lot of time.

CodePudding user response：

Without the data sample it's quite difficult to guess what kind of approach you can use. I made a sample for 2 cases:

a) your data in table is organized, so you can use the NumPy's reshape

b) your data in table is not organized, so you could use the interpolation to some regular grid

#!/usr/bin/env ipython
import pandas as pd
import numpy as np
# -------------------------
# example with data at regular grid:
xx = np.linspace(0.,360,100);ddx = np.mean(np.diff(xx))
yy = np.linspace(-180.0,180.0,100);ddy = np.mean(np.diff(yy))
xm,ym = np.meshgrid(xx,yy);
zz = 50.0   10.0*np.random.random((np.size(yy),np.size(xx)));
data = {'lon':xm.flatten(),'lat':ym.flatten(),'data':zz.flatten()};
df = pd.DataFrame.from_dict(data);
# let us convert this data back to understandable form:
xo = np.unique(df['lon'].values);yo = np.unique(df['lat'].values);zo = df['data'].values;
zreg = np.reshape(zo,(np.size(yo),np.size(xo)));
print(zz == zreg);# is the original the same with the one from Pandas dataframe?
# =========================================================================================================
# ---------------------------------
# example with data randomly ordered, irregular space?
xcoords  = xm.flatten() ddx/2*np.random.random(np.size(zz.flatten())) # original coords   some small noise (half the cell) 
ycoords  = ym.flatten() ddy/2*np.random.random(np.size(zz.flatten())) # original coords   some small noise (half the cell)
points = np.concatenate((xcoords[:,np.newaxis],ycoords[:,np.newaxis],zz.flatten()[:,np.newaxis]),axis=1);
points =  points[points[:, 2].argsort()] # let us sort points by values
data = {'lon':points[:,0],'lat':points[:,1],'data':points[:,2]}; 
# -----------------------------------------------------------------
df = pd.DataFrame.from_dict(data);
xp = df['lon'].values;yp = df['lat'].values;zp = df['data'].values

from scipy.interpolate import griddata
zo = griddata((xp,yp),zp,(xm,ym),'nearest'); # I would make some interpolation to regular grid...
print(zz == zo);

Of course, if you have 7 million points, then you might need quite some amount of memory to keep the data. I was able to test my code with 2000x2000 and 3000x3000 points, but only on a machine with a lot of memory. My old laptop on the other hand could only with 1000x1000. In any case, with irregular data, the interpolated value can sometimes differ from the original value, but in my mind the difference is relatively small.

Writing the netCDF is really easy afterwards:

from netCDF4 import Dataset
with Dataset('test.nc','w','NETCDF3') as ncout:
    ncout.createDimension('lon',np.size(xx));
    ncout.createDimension('lat',np.size(yy));
    xvar = ncout.createVariable('lon','float32',('lon'));xvar[:] = xx
    yvar = ncout.createVariable('lat','float32',('lat'));yvar[:] = yy
    zvar = ncout.createVariable('data','float32',('lat','lon'));zvar[:] = zo

CodePudding user response：

If df is like you describe, something like df.set_index(['lat', 'lon']).to_xarray() might do.

Here some lines that work on my computer :

import pandas as pd

df = pd.DataFrame(data=[[10, 10, 0.1], [10, 15, 0.2], [15, 10, 0.3], [15, 15, 0.3]], 
                  columns=['lon', 'lat', 'soil_moisture'])
df.set_index(['lat', 'lon']).to_xarray()

The result is a nice xarray.Dataset.