I have a dataset, for simplicity I will indicate only one main feature - postalCode. And I need to get another one feature (main post office of this area) through function call and add to dataframe (sample).
Both are integers.
postalCode | mainPostCode |
---|---|
12345 | 12301 |
23456 | 23407 |
34567 | 34504 |
Some words about function: it takes first 3 digits of postalCode and then takes from list of all zipcodes minimum value, that starts from this 3 digits.
You will not always find in this list a value that will look like XXX01, that can be XXX05 or XXX07 or XXX(any other). Let's assume it can be any number.
List of zipcodes looks like that (about 40K elements):
zipcode = [1001,1002,...,99999]
My function looks like:
def findMainPostOffice(num):
''' takes zip and returns nearest available main zip in list 'zipcode' '''
start = int(str(num // 100) '00')
m = min([i for i in zipcode if i > start and i < num], default=num)
return m
I call this function like:
df['mainPostCode'] = df.postalCode.apply(findMainPostOffice)
The problem is that this function takes a very long time. On my dataset it should take about 72 hours. Could you please help me to speed up this.
CodePudding user response:
IIUC, you can use groupby
to find the minimum (the main postal code)
df['mainPostCode'] = (df.groupby(df['postalCode'].astype(str).str.zfill(5).str[:2])
.transform('min'))
print(df)
# Output
postalCode mainPostCode
0 23041 23003
1 48558 48000
2 52895 52000
3 39817 39000
4 40427 40000
... ... ...
39995 81184 81000
39996 7125 7001
39997 22773 22003
39998 88802 88002
39999 58510 58000
[40000 rows x 2 columns]
Input:
import pandas as pd
import numpy as np
np.random.seed(2023)
df = pd.DataFrame({'postalCode': np.random.randint(1000, 100000, 40000)})
CodePudding user response:
You should try to move as much computation out of the function as possible.
For any prefix, we want the lowest main postal office, so we can create a map of prefixes to main postal codes. Since there's only 1000
possible prefixes, this doesn't take up that much space.
One approach is to create a dict of prefixes to all possible zipcodes. For the list of zipcodes [10001, 10002, 20010, 20004]
, we create the map:
{
100: 10001,
200: 20004,
}
We don't care about zip codes 10001
or 20010
, because we would never return them.
By only creating the map once, and using it multiple times, we don't have to check the entire list each time we search for a zipcode.
Here's the code to generate the map:
zipcodes = [10000, 99999]
prefix_map = {}
for z in zipcodes:
prefix = z // 100
if prefix in prefix_map:
prefix_map[prefix] = min(prefix_map[prefix], z)
else:
prefix_map[prefix] = z
and here's code that uses the prefix_map
def findMainPostOffice(num):
global prefix_map
prefix = num // 100
if prefix in prefix_map and prefix_map[prefix] < num:
return prefix_map[prefix]
return num