Basically, I have a pandas dataframe (from a CSV) with a list of zip codes. Now I need to sort them into the corresponding freight zones which are defined by individual ranges, an example is given below. In total I got 126 zip code ranges I need to define, but for simplicity, I am shortening them.
E.g.
Freight Area 1: Zip Codes 10000 - 19999, 40000 - 49999
Freight Area 2: Zip Codes 30000 - 39999, 60000 - 69999
Freight Area 3: Zip Codes 20000 - 29999, 50000 - 59999
Freight Area 4: Zip Codes 00000 - 09999, 70000 - 79999
Unknown freight zone: All other zip codes.
What would be a good approach to sort the zip codes of a CSV into the mentioned freight areas?
I already tried iterating over all 1500 entries of my dataframe with a for loop and a switch statement. But it doesn't seem to be effective.
Also, I tried to use the pd.cut(), but I always get the following error: ValueError: bins must increase monotonically.
The bins are sorted in ascending, I rechecked them by sorting them in excel and copying them again.
How would you face such an issue?
I am not looking for a finished code, just for some ideas to solve the sorting into individual ranges. I want to avoid having a huge list of 10.000 zip codes since changes would be annoying to apply and it would be a big list.
Edit: Sample Data:
ID,zip
PD0001,14989
PD0002,35345
PD0003,23619
PD0004,1456
PD0005,93451
PD0006,23112
PD0007,59838
PD0008,9445
PD0009,23119
PD0010,68990
Its a simple CSV, sadly I cant upload the file itself. The customer data itself is more complex but I am already shortening it.
CodePudding user response:
Use pd.cut
with fillna
bins = [0, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000]
labels = ['Area 4', 'Area 1', 'Area 3', 'Area 2', 'Area 1', 'Area 3', 'Area 2', 'Area 4']
df['frieght_area'] = pd.cut(
df.zip,
bins,
labels=labels,
ordered=False,
include_lowest=True,
right=False
)
df['frieght_area'] = df.frieght_area.cat.add_categories('Unknown').fillna('Unknown')
Output
ID zip frieght_area
0 PD0001 14989 Area 1
1 PD0002 35345 Area 2
2 PD0003 23619 Area 3
3 PD0004 1456 Area 4
4 PD0005 93451 Unknown
5 PD0006 23112 Area 3
6 PD0007 59838 Area 3
7 PD0008 9445 Area 4
8 PD0009 23119 Area 3
9 PD0010 68990 Area 2
Note: Also, it is not a good idea to use a Python keyword as the column name. Use zip_code
instead of zip
CodePudding user response:
I think in this case you should use an IntervalIndex
to cut, getting a Categorical, then you can map to the labels. Unfortunately you will have to type in the 126 bin definitions by hand (unless you can parse the PDF in some way), but let's say you convert what you have into a form like this:
bins = pd.IntervalIndex.from_tuples([
(10000, 19999), (40000, 49999),
(30000, 39999), (60000, 69999),
(20000, 29999), (50000, 59999),
(0, 9999), (70000, 79999)])
labels = ['Area 1', 'Area 1',
'Area 2', 'Area 2',
'Area 3', 'Area 3',
'Area 4', 'Area 4']
Then let's say your data looks like:
df = pd.DataFrame({'zip_code': ['14989', '35345', '23619', '01456', '93451', '23112', '59838', '09445', '23119', '68990']})
You can get the area for all of them like this:
df['area'] = pd.cut(df['zip_code'].astype(int), bins).map(dict(zip(bins, labels)))
Explanation: the IntervalIndex doesn't need to be sorted, and can handle gaps in the intervals (which will be NaN). The categories are intervals, so we map them to a dictionary of interval: freight code
to get the desired output.
Output:
zip_code area
0 14989 Area 1
1 35345 Area 2
2 23619 Area 3
3 01456 Area 4
4 93451 NaN
5 23112 Area 3
6 59838 Area 3
7 09445 Area 4
8 23119 Area 3
9 68990 Area 2