How to speed up python function call-CodePudding

I have a dataset, for simplicity I will indicate only one main feature - postalCode. And I need to get another one feature (main post office of this area) through function call and add to dataframe (sample).

Both are integers.

postalCode	mainPostCode
12345	12301
23456	23407
34567	34504

Some words about function: it takes first 3 digits of postalCode and then takes from list of all zipcodes minimum value, that starts from this 3 digits.

You will not always find in this list a value that will look like XXX01, that can be XXX05 or XXX07 or XXX(any other). Let's assume it can be any number.

List of zipcodes looks like that (about 40K elements):

zipcode = [1001,1002,...,99999]

My function looks like:

def findMainPostOffice(num):

    ''' takes zip and returns nearest available main zip in list 'zipcode' '''

    start = int(str(num // 100)   '00')
    m = min([i for i in zipcode if i > start and i < num], default=num)
    return m

I call this function like:

df['mainPostCode'] = df.postalCode.apply(findMainPostOffice)

The problem is that this function takes a very long time. On my dataset it should take about 72 hours. Could you please help me to speed up this.

CodePudding user response：

IIUC, you can use groupby to find the minimum (the main postal code)

df['mainPostCode'] = (df.groupby(df['postalCode'].astype(str).str.zfill(5).str[:2])
                        .transform('min'))
print(df)

# Output
       postalCode  mainPostCode
0           23041         23003
1           48558         48000
2           52895         52000
3           39817         39000
4           40427         40000
...           ...           ...
39995       81184         81000
39996        7125          7001
39997       22773         22003
39998       88802         88002
39999       58510         58000

[40000 rows x 2 columns]

Input:

import pandas as pd
import numpy as np

np.random.seed(2023)
df = pd.DataFrame({'postalCode': np.random.randint(1000, 100000, 40000)})

CodePudding user response：

You should try to move as much computation out of the function as possible.

For any prefix, we want the lowest main postal office, so we can create a map of prefixes to main postal codes. Since there's only 1000 possible prefixes, this doesn't take up that much space.

One approach is to create a dict of prefixes to all possible zipcodes. For the list of zipcodes [10001, 10002, 20010, 20004], we create the map:

{
    100: 10001,
    200: 20004,
}

We don't care about zip codes 10001 or 20010, because we would never return them.

By only creating the map once, and using it multiple times, we don't have to check the entire list each time we search for a zipcode.

Here's the code to generate the map:

zipcodes = [10000, 99999]

prefix_map = {}

for z in zipcodes:
    prefix = z // 100
    if prefix in prefix_map:
        prefix_map[prefix] = min(prefix_map[prefix], z)
    else:
        prefix_map[prefix] = z

and here's code that uses the prefix_map

def findMainPostOffice(num):
    global prefix_map
    prefix = num // 100
    if prefix in prefix_map and prefix_map[prefix] < num:
        return prefix_map[prefix]
    return num