Faster solution for checking if value is already in another dataframe with numpy-CodePudding

this is my code right now

for i, row in gdf_pot.iterrows():
        if row.ID not in historized.ID.to_list():
            gdf_pot.at[i, "FLAG_NEW"] = 1
        else:
            gdf_pot.at[i, "FLAG_NEW"] = 0

it's very slow, because the dataframe is very big.

I saw some solutions with np.where but I could'nt make it work.

Maybe you have some ideas?

Thanks.

CodePudding user response：

One way is to use boolean indexing with pandas.Series.isin and pandas.Series.astype.

The core data structure in GeoPandas is the geopandas.GeoDataFrame, a subclass of pandas.DataFrame

gdf_pot["FLAG_NEW"] = gdf_pot["ID"].isin(historized["ID"]).astype(int)

NB : True behaves as 1 and False as 0 in Python. So, when we call astype(int), the boolean Series returned by isin() is mapped implicitly to those two numbers.