I want to execute the following lines of python code in Polars as a UDF:
w = wkt.loads('POLYGON((-160.043334960938 70.6363054807905, -160.037841796875 70.6363054807905, -160.037841796875 70.6344840663086, -160.043334960938 70.6344840663086, -160.043334960938 70.6363054807905))')
polygon (optionally including holes).
j = shapely.geometry.mapping(w)
h3.polyfill(j, res=10, geo_json_conformant=True)
In pandas/geopandas:
import pandas as pd
import geopandas as gpd
import polars as pl
from shapely import wkt
pandas_df = pd.DataFrame({'quadkey': {0: '0022133222330023',
1: '0022133222330031',
2: '0022133222330100'},
'tile': {0: 'POLYGON((-160.043334960938 70.6363054807905, -160.037841796875 70.6363054807905, -160.037841796875 70.6344840663086, -160.043334960938 70.6344840663086, -160.043334960938 70.6363054807905))',
1: 'POLYGON((-160.032348632812 70.6381267305321, -160.02685546875 70.6381267305321, -160.02685546875 70.6363054807905, -160.032348632812 70.6363054807905, -160.032348632812 70.6381267305321))',
2: 'POLYGON((-160.02685546875 70.6417687358462, -160.021362304688 70.6417687358462, -160.021362304688 70.6399478155463, -160.02685546875 70.6399478155463, -160.02685546875 70.6417687358462))'},
'avg_d_kbps': {0: 15600, 1: 6790, 2: 9619},
'avg_u_kbps': {0: 14609, 1: 22363, 2: 15757},
'avg_lat_ms': {0: 168, 1: 68, 2: 92},
'tests': {0: 2, 1: 1, 2: 6},
'devices': {0: 1, 1: 1, 2: 1}}
)
# display(pandas_df)
gdf = pandas_df.copy()
gdf['geometry'] = gpd.GeoSeries.from_wkt(pandas_df['tile'])
import h3pandas
display(gdf.h3.polyfill_resample(10))
This works super quickly and easily. However, the polyfill function called from pandas apply as a UDF is too slow for the size of my dataset.
Instead, I would love to use polars but I run into several issues:
geo type is not understood
rying to move to polars for better performance
pl.from_pandas(gdf)
fails with: ArrowTypeError: Did not pass numpy.dtype object
it looks like geoarrow / geoparquet is not supported by polars
numpy vectorized polars interface fails with missing geometry types
polars_df = pl.from_pandas(pandas_df)
out = polars_df.select(
[
gpd.GeoSeries.from_wkt(pl.col('tile')),
]
)
fails with:
TypeError: 'data' should be array of geometry objects. Use from_shapely, from_wkb, from_wkt functions to construct a GeometryArray.
all by hand
polars_df.with_column(pl.col('tile').map(lambda x: h3.polyfill(shapely.geometry.mapping(wkt.loads(x)), res=10, geo_json_conformant=True)).alias('geometry'))
fails with:
Conversion of polars data type Utf8 to C-type not implemented.
this last option seems to be the most promising one (no special geospatial-type errors). But this generic error message of strings/Utf8 type for C not being implemented sounds very strange to me.
Furthermore:
polars_df.select(pl.col('tile').apply(lambda x: h3.polyfill(shapely.geometry.mapping(wkt.loads(x)), res=10, geo_json_conformant=True)))
works - but is lacking the other columns - i.e. syntax to manually select these is inconvenient. Though this is also failing when appending a:
.explode('tile').collect()
# InvalidOperationError: cannot explode dtype: Object("object")
CodePudding user response:
Polars benefits come from the fact that the data is stored in type inflexible arrow2 arrays and that the computations are optimized (similar to a db query optimizer) and run in rust. Any time you use apply
in polars then you lose the benefit of polars because it's sending the function back to python instead of it being optimized and computed in rust. Polars doesn't support anything that looks like a shapely geometry nor does it have geospatial geometry operators built in.
There is geopolars which is in its absolute infancy and probably not particularly usable yet. Long story short, polars (in python) isn't able to help for geo computations yet.
Back to your original problem, what if instead of gdf.h3.polyfill_resample(10)
you do something like (emphasis on something like as I'm not really familiar with h3
offhand)
from shapely.geometry import mapping
h3.polyfill(mapping(gdf.geometry))
You might need to explore the dict that mapping(gdf.geometry)
returns to get something conforming to the geojson input that h3 uses.
The idea being that feeding the entire input directly to h3 will (at least potentially, I'm not aware of its internals) allow it to do all the looping and iterating in C rather than having the looping in python which is what you get with an apply
.
CodePudding user response:
To address some of your polars errors:
The wkt
functions can't handle a pl.Series
- you can use .to_numpy()
to provide a numpy array instead:
gpd.GeoSeries.from_wkt(polars_df.get_column("tile").to_numpy())
0 POLYGON ((-160.04333 70.63631, -160.03784 70.6...
1 POLYGON ((-160.03235 70.63813, -160.02686 70.6...
2 POLYGON ((-160.02686 70.64177, -160.02136 70.6...
dtype: geometry
You can use .with_columns()
instead of .select()
:
polars_df.with_columns(pl.col('tile').apply(lambda x: h3.polyfill(shapely.geometry.mapping(wkt.loads(x)), res=10, geo_json_conformant=True)))
shape: (3, 7)
┌──────────────────┬─────────────────────────────────────┬────────────┬────────────┬────────────┬───────┬─────────┐
│ quadkey | tile | avg_d_kbps | avg_u_kbps | avg_lat_ms | tests | devices │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | object | i64 | i64 | i64 | i64 | i64 │
╞══════════════════╪═════════════════════════════════════╪════════════╪════════════╪════════════╪═══════╪═════════╡
│ 0022133222330023 | {'8a0d1c1306a7fff', '8a0d1c1306b... | 15600 | 14609 | 168 | 2 | 1 │
│ 0022133222330031 | {'8a0d1c130757fff', '8a0d1c13062... | 6790 | 22363 | 68 | 1 | 1 │
│ 0022133222330100 | {'8a0d1c1300d7fff', '8a0d1c1300c... | 9619 | 15757 | 92 | 6 | 1 │
└──────────────────┴─────────────────────────────────────┴────────────┴────────────┴────────────┴───────┴─────────┘
h3.polyfill()
is returning a python set object which polars does not really "recognize" as it stands.
You can convert the set to list()
and polars will give you a list[str]
column instead of object
- which you can .explode()
without error.
polars_df.with_columns(
pl.col('tile').apply(lambda x: list(h3.polyfill(shapely.geometry.mapping(wkt.loads(x)), res=10, geo_json_conformant=True)))
.alias('h3_polyfill')
).explode('h3_polyfill')
shape: (9, 8)
┌──────────────────┬─────────────────────────────────────┬────────────┬────────────┬────────────┬───────┬─────────┬─────────────────┐
│ quadkey | tile | avg_d_kbps | avg_u_kbps | avg_lat_ms | tests | devices | h3_polyfill │
│ --- | --- | --- | --- | --- | --- | --- | --- │
│ str | str | i64 | i64 | i64 | i64 | i64 | str │
╞══════════════════╪═════════════════════════════════════╪════════════╪════════════╪════════════╪═══════╪═════════╪═════════════════╡
│ 0022133222330023 | POLYGON((-160.043334960938 70.63... | 15600 | 14609 | 168 | 2 | 1 | 8a0d1c1306a7fff │
│ 0022133222330023 | POLYGON((-160.043334960938 70.63... | 15600 | 14609 | 168 | 2 | 1 | 8a0d1c1306b7fff │
│ 0022133222330023 | POLYGON((-160.043334960938 70.63... | 15600 | 14609 | 168 | 2 | 1 | 8a0d1c13079ffff │
│ 0022133222330031 | POLYGON((-160.032348632812 70.63... | 6790 | 22363 | 68 | 1 | 1 | 8a0d1c130757fff │
│ 0022133222330031 | POLYGON((-160.032348632812 70.63... | 6790 | 22363 | 68 | 1 | 1 | 8a0d1c130627fff │
│ 0022133222330031 | POLYGON((-160.032348632812 70.63... | 6790 | 22363 | 68 | 1 | 1 | 8a0d1c13070ffff │
│ 0022133222330100 | POLYGON((-160.02685546875 70.641... | 9619 | 15757 | 92 | 6 | 1 | 8a0d1c1300d7fff │
│ 0022133222330100 | POLYGON((-160.02685546875 70.641... | 9619 | 15757 | 92 | 6 | 1 | 8a0d1c1300c7fff │
│ 0022133222330100 | POLYGON((-160.02685546875 70.641... | 9619 | 15757 | 92 | 6 | 1 | 8a0d1c1300f7fff │
└──────────────────┴─────────────────────────────────────┴────────────┴────────────┴────────────┴───────┴─────────┴─────────────────┘
There's probably not much difference compared to doing the "all by hand" approach in pandas:
pandas_df['geometry'] = wkt.loads(pandas_df['tile'])
pandas_df = pandas_df.assign(
h3_polyfill=pandas_df['geometry'].map(lambda tile: h3.polyfill(shapely.geometry.mapping(tile), 10, True))
).explode('h3_polyfill')