Home > other >  How could I sort zip codes of a data frame into freight areas in python?
How could I sort zip codes of a data frame into freight areas in python?

Time:10-03

Basically, I have a pandas dataframe (from a CSV) with a list of zip codes. Now I need to sort them into the corresponding freight zones which are defined by individual ranges, an example is given below. In total I got 126 zip code ranges I need to define, but for simplicity, I am shortening them.

E.g.
Freight Area 1: Zip Codes 10000 - 19999, 40000 - 49999
Freight Area 2: Zip Codes 30000 - 39999, 60000 - 69999
Freight Area 3: Zip Codes 20000 - 29999, 50000 - 59999
Freight Area 4: Zip Codes 00000 - 09999, 70000 - 79999
Unknown freight zone: All other zip codes.

What would be a good approach to sort the zip codes of a CSV into the mentioned freight areas?

I already tried iterating over all 1500 entries of my dataframe with a for loop and a switch statement. But it doesn't seem to be effective.

Also, I tried to use the pd.cut(), but I always get the following error: ValueError: bins must increase monotonically.

The bins are sorted in ascending, I rechecked them by sorting them in excel and copying them again.

How would you face such an issue?

I am not looking for a finished code, just for some ideas to solve the sorting into individual ranges. I want to avoid having a huge list of 10.000 zip codes since changes would be annoying to apply and it would be a big list.

Edit: Sample Data:

ID,zip
PD0001,14989
PD0002,35345
PD0003,23619
PD0004,1456
PD0005,93451
PD0006,23112
PD0007,59838
PD0008,9445
PD0009,23119
PD0010,68990

Its a simple CSV, sadly I cant upload the file itself. The customer data itself is more complex but I am already shortening it.

CodePudding user response:

Use pd.cut with fillna

bins = [0, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000]
labels = ['Area 4', 'Area 1', 'Area 3', 'Area 2', 'Area 1', 'Area 3', 'Area 2', 'Area 4']
df['frieght_area'] = pd.cut(
    df.zip,
    bins,
    labels=labels,
    ordered=False,
    include_lowest=True,
    right=False
)
df['frieght_area'] = df.frieght_area.cat.add_categories('Unknown').fillna('Unknown')

Output

       ID    zip frieght_area
0  PD0001  14989       Area 1
1  PD0002  35345       Area 2
2  PD0003  23619       Area 3
3  PD0004   1456       Area 4
4  PD0005  93451      Unknown
5  PD0006  23112       Area 3
6  PD0007  59838       Area 3
7  PD0008   9445       Area 4
8  PD0009  23119       Area 3
9  PD0010  68990       Area 2

Note: Also, it is not a good idea to use a Python keyword as the column name. Use zip_code instead of zip

CodePudding user response:

I think in this case you should use an IntervalIndex to cut, getting a Categorical, then you can map to the labels. Unfortunately you will have to type in the 126 bin definitions by hand (unless you can parse the PDF in some way), but let's say you convert what you have into a form like this:

bins = pd.IntervalIndex.from_tuples([
        (10000, 19999), (40000, 49999),
        (30000, 39999), (60000, 69999),
        (20000, 29999), (50000, 59999), 
        (0, 9999), (70000, 79999)])

labels = ['Area 1', 'Area 1', 
          'Area 2', 'Area 2', 
          'Area 3', 'Area 3', 
          'Area 4', 'Area 4']

Then let's say your data looks like:

df = pd.DataFrame({'zip_code': ['14989', '35345', '23619', '01456', '93451', '23112', '59838', '09445', '23119', '68990']})

You can get the area for all of them like this:

df['area'] = pd.cut(df['zip_code'].astype(int), bins).map(dict(zip(bins, labels)))

Explanation: the IntervalIndex doesn't need to be sorted, and can handle gaps in the intervals (which will be NaN). The categories are intervals, so we map them to a dictionary of interval: freight code to get the desired output.

Output:

    zip_code    area
0   14989   Area 1
1   35345   Area 2
2   23619   Area 3
3   01456   Area 4
4   93451   NaN
5   23112   Area 3
6   59838   Area 3
7   09445   Area 4
8   23119   Area 3
9   68990   Area 2
  • Related