How to efficiently browse and compare the values of a very large dictionary with the elements of a l-CodePudding

I have a dictionary (list_image_dict) and a list (structures.geometry), I want to compare each value in the dictionary with the values in the list, perform a quick operation on it and replace the value in the dictionary.

But list_image_dict contains 300623 key/value pairs, going through all values and comparing them with each element of structures.geometry is very long. Several tens of minutes. My question is how to improve the speed of execution?

I tried by multiprocessing with 16 cores on simply list_image list. Each elements of the list are compared in parallel with the elements of structures.geometry, it's a little bit faster but still very slow (still several tens of minutes). structures.geometry contains only 156 elements.

def make_sort(layers, structures, threshold):
    def coverage(a, b): return a.area/b.area*100
    def label(
        polygon): return structures.loc[structures.geometry == polygon, "label"].values[0]
    frames = pd.concat([*layers], ignore_index=True)
    frames.crs = "epsg:3857"
    frames = frames.to_crs(epsg=4326)
    main = gpd.GeoDataFrame(geometry=frames.geometry)

    list_image = main.geometry.tolist() 
    #list_image has 300623 elements. 
    list_image_dict = {images.wkt: images for images in list_image}
    
    for key, value in list_image_dict.items(): #main loop on list_image_dict
        liste=[]
        for item in structures.geometry: #structures has 156 elements.
            if value.intersects(item):
                x = value.intersection(item)
                #for a certain threshold coverage (in percent) present on the image
                #the polygon is added to the liste. 
                if coverage(x, item) >= threshold:
                    liste.append([x, str(label(item))])
        list_image_dict[key] = liste
    return list_image_dict

With the help of people in comments, this way leads to some minutes less but it is still very long.

def make_sort(layers, structures, threshold):
    def coverage(a, b): return a.area/b.area*100

    label = structures["label"].to_list()
    geom = structures["geometry"].to_list()
    
    frames = pd.concat([*layers], ignore_index=True)
    frames.crs = "epsg:3857"
    frames = frames.to_crs(epsg=4326)
    main = gpd.GeoDataFrame(geometry=frames.geometry)
    
    final = []
    for elem in main.geometry:
        liste=[]
        for item in structures.geometry:
            if coverage(elem.intersection(item), item) >= threshold:
                liste.append([elem.intersection(item), label[geom.index(item)]])

        final.append({elem.wkt: liste})

    result = dict(ChainMap(*final))
    return result

CodePudding user response：

IIUC, now, you have 2 GeoDataFrames: main (300623 polygons) and structures (156 polygons). First you want to find intersection then select only polygons which the coverage is greater than threshold. The bottleneck is to find the intersection of one polygon from structures to the 300K polygons of main.

I think the better solution is to use Spatial Index and R-Tree. For that, you need to install PyGeos to access to main.sindex.

To quickly find which polygons intersect to another:

for idx, item in structures.iterrows():
    print(item['label'])

    # All polygons...
    indexes = main.sindex.query(item['geometry'], predicate='intersects')
    # ...greater than or equal to threshold
    indexes = main.iloc[indexes, main.columns.get_loc('geometry')] \
                 .apply(coverage, b=item['geometry']).ge(threshold) \
                 .loc[lambda x: x].index
    # Do stuff here
    ...