I have a dictionary (list_image_dict) and a list (structures.geometry), I want to compare each value in the dictionary with the values in the list, perform a quick operation on it and replace the value in the dictionary.
But list_image_dict
contains 300623 key/value pairs, going through all values and comparing them with each element of structures.geometry
is very long. Several tens of minutes. My question is how to improve the speed of execution?
I tried by multiprocessing with 16 cores on simply list_image
list. Each elements of the list are compared in parallel with the elements of structures.geometry, it's a little bit faster but still very slow (still several tens of minutes). structures.geometry
contains only 156 elements.
def make_sort(layers, structures, threshold):
def coverage(a, b): return a.area/b.area*100
def label(
polygon): return structures.loc[structures.geometry == polygon, "label"].values[0]
frames = pd.concat([*layers], ignore_index=True)
frames.crs = "epsg:3857"
frames = frames.to_crs(epsg=4326)
main = gpd.GeoDataFrame(geometry=frames.geometry)
list_image = main.geometry.tolist()
#list_image has 300623 elements.
list_image_dict = {images.wkt: images for images in list_image}
for key, value in list_image_dict.items(): #main loop on list_image_dict
liste=[]
for item in structures.geometry: #structures has 156 elements.
if value.intersects(item):
x = value.intersection(item)
#for a certain threshold coverage (in percent) present on the image
#the polygon is added to the liste.
if coverage(x, item) >= threshold:
liste.append([x, str(label(item))])
list_image_dict[key] = liste
return list_image_dict
With the help of people in comments, this way leads to some minutes less but it is still very long.
def make_sort(layers, structures, threshold):
def coverage(a, b): return a.area/b.area*100
label = structures["label"].to_list()
geom = structures["geometry"].to_list()
frames = pd.concat([*layers], ignore_index=True)
frames.crs = "epsg:3857"
frames = frames.to_crs(epsg=4326)
main = gpd.GeoDataFrame(geometry=frames.geometry)
final = []
for elem in main.geometry:
liste=[]
for item in structures.geometry:
if coverage(elem.intersection(item), item) >= threshold:
liste.append([elem.intersection(item), label[geom.index(item)]])
final.append({elem.wkt: liste})
result = dict(ChainMap(*final))
return result
CodePudding user response:
IIUC, now, you have 2 GeoDataFrames: main
(300623 polygons) and structures
(156 polygons). First you want to find intersection then select only polygons which the coverage is greater than threshold
. The bottleneck is to find the intersection of one polygon from structures
to the 300K polygons of main
.
I think the better solution is to use Spatial Index and R-Tree. For that, you need to install PyGeos
to access to main.sindex
.
To quickly find which polygons intersect to another:
for idx, item in structures.iterrows():
print(item['label'])
# All polygons...
indexes = main.sindex.query(item['geometry'], predicate='intersects')
# ...greater than or equal to threshold
indexes = main.iloc[indexes, main.columns.get_loc('geometry')] \
.apply(coverage, b=item['geometry']).ge(threshold) \
.loc[lambda x: x].index
# Do stuff here
...