More efficient implementations of looping code for matching function in Python-CodePudding

I have two datasets - one has census tract number and zip codes for each observation, and the other just has census tract numbers. I'm trying to add zip codes to the observations that don't already have them. I have currently done this by looping through the census tract numbers of both datasets and if they match, adding the zip codes from the dataset that has zips codes to the dataset that doesn't (by way of appending the zip code to a list that is eventually converted into a pandas column). As I understand it, this is quite inefficient, and since the datasets that I'm working with have about 70,000 and 170,000 lines respectively, I'd like to find a more (the most?) efficient implementation of this idea as possible.

This is the code I have currently:

def match(str1, str2) -> bool:
  return str1 == str2

zip_tract_df_dummy = zip_tract_df.head()
food_access_df_dummy = food_access_df.head()

tracts_zips = np.array(zip_tract_df_dummy['TRACT']).astype(str)
tracts_zips = tracts_zips.tolist()
food_access_tracts = np.array(food_access_df_dummy['CensusTract']).astype(str)
food_access_tracts = food_access_tracts.tolist()

zips_for_tracts = []
for i in range(len(food_access_tracts)):
  for j in range(len(tracts_zips)):
    if match(food_access_tracts[i], tracts_zips[j]):
      print("Matched")
      zips_for_tracts.append(zip_tract_df_dummy.at[j, 'ZIP'])
      break
    else:
      print("No matches")

food_access_df_dummy = food_access_df_dummy.assign(ZIP=zips_for_tracts)

Here are some snapshots of the datasets:

### zip_tract_df_dummy
    ZIP TRACT   RES_RATIO   BUS_RATIO   OTH_RATIO   TOT_RATIO
0   501 36103158607 0.000000    1.000000    0.000000    1.000000
1   601 72001956700 0.623135    0.416021    0.433121    0.605512
2   601 72113071700 0.188169    0.211886    0.133758    0.188271
3   601 72001956800 0.017729    0.025840    0.089172    0.020029
4   601 72001956600 0.166052    0.341085    0.337580    0.181221

### food_access_df_dummy[['CensusTract', 'State']]
    CensusTract State
0   1001020100  Alabama
1   1001020200  Alabama
2   1001020300  Alabama
3   1001020400  Alabama
4   1001020500  Alabama

Would someone be able to show me how I might implement this more efficiently, and if it wouldn't be too much more to ask, possibly point me towards some resources that talk about how to write computationally-efficient code?

CodePudding user response：

In terms of the most efficient way to solve this problem: Since your data is already in Pandas dataframes, you can take advantage of methods that Pandas has built-in!

This is a classic problem of "joining" two datasets in a database. Under the hood, Pandas uses optimized C code to perform this operation very efficiently.

Give this a shot and see if it performs better on your dataset:

final_df = food_access_df_dummy.merge(
    zip_tract_df_dummy,
    left_on="CensusTract",
    right_on="TRACT",
    how="inner"
)[["CensusTract", "State", "ZIP"]]

Explanation:

merge will combine the two dataframes, matching values in the column labeled "left_on" in the first dataframe, with values in the column labeled "right_on" in the second dataframe. The "how" keyword tells the merge what kind of merge to perform. "inner" means it will only give you back rows which have matches in both of the dataframes. Please see here for more details.

The topic of writing computationally efficient code is very broad, and many resources are available, but it's hard to recommend without getting into specifics. One thing I would absolutely recommend for improving your code's performance is reading about the APIs used by your libraries (like Pandas in this case), because the developers of those libraries have spent a very long time tweaking their code to perform optimally. Building on the shoulders of giants, you could say!

CodePudding user response：

How I might implement this more efficiently

Since your keys are strings, you can make a dictionary for constant-time lookups:

tract_zip_reverse_index = {tract_zip: i 
                           for i, tract_zip in enumerate(tracts_zips)}

for food_tract_zip in food_access_tracts:
    if food_tract_zip in tract_zip_reverse_index:
        zips_for_tracts.append(
                zip_tract_df_dummy.at[tract_zip_reverse_index[food_tract_zip], 'ZIP'])

This replaces your entire nested loop. The keys of the dictionary are strings, whose values are their index in your original tract_zips list.

Instead of taking time proportional to the product of the lengths (70k * 170k ~ 10 billion), this takes time proportional to the sum of the lengths of the lists (70k 170k = 240,000).

possibly point me towards some resources that talk about how to write computationally-efficient code

This is an enormous subject, and unfortunately too broad to be answerable here. It's impossible to give good advice without more personal context, which would be off-topic for this site (but are the focus of other sites on the internet).