How do I check for an integer value in a column in 1 dataframe to exist in a range split between 2 c-CodePudding

To explain the question better:

I have 2 dataframes:

DF1 (master):

    CodeRange                                             Sector Start   End
0   0100-0999                  Agriculture, Forestry and Fishing  0100  0999
1   1000-1499                                             Mining  1000  1499
2   1500-1799                                       Construction  1500  1799
3   1800-1999                                           not used  1800  1999
4   2000-3999                                      Manufacturing  2000  3999
5   4000-4999  Transportation, Communications, Electric, Gas ...  4000  4999
6   5000-5199                                    Wholesale Trade  5000  5199
7   5200-5999                                       Retail Trade  5200  5999
8   6000-6799                 Finance, Insurance and Real Estate  6000  6799
9   7000-8999                                           Services  7000  8999
10  9100-9729                              Public Administration  9100  9729
11  9900-9999                                    Nonclassifiable  9900  9999

and DF2:

    SICCode Sector
0   1230    Agro
1   4974    Utils
2   5120    shops
3   9997    Utils

In DF1, I was able to split the "CodeRange" column values into 2 columns ("Start" and "End") and have converted them to int.

I basically want to check for each SICCode in DF2 to exist between which range and update the "Sector" value in DF2 with the corresponding value under "Division" column in DF1.

The final DF2 should look like:

DF2:

    SICCode Sector
0   1230    Agriculture, Forestry and Fishing
1   4974    Transportation, Communication...
2   5120    Wholesale Trade
3   9997    Non-classifiable

CodePudding user response：

You can definitely optimise my solution by using masks I think, but you can achieve this with the following:

data = []
for i in range(len(df2)):
    code = df2["SICCode"].iloc[i]
    for j in range(len(df1)):
        start = df1["Start"].iloc[j]
        end = df1["End"].iloc[j]
        if code >= start and code <= end:
            data.append(df1["Sector"].iloc[j])
            continue # to move to the next i

df2["Sector"] = data

CodePudding user response：

a little more compact solution, without loops

key is in creating the index 'start_idx' by dividing the numbers by 1000, to help us merge, subsquently, we check if the SICCode lies within the range, when it doesn't we make the division blank

df3= df.assign(start_idx=(df['Start']//1000).astype(int)).merge(
    df2.assign(start_idx=(df2['SICCode']//1000).astype(int)), on='start_idx', how='left')
df3['Divison']=np.where( (df3['SICCode']> df3['Start']) &
                       (  df3['SICCode']<=df3['End']  ), df3['Sector_y'], "")
df3.drop(columns=['start_idx','x_y','SICCode','Sector_y'])

    x_x     CodeRange   Sector_x                        Start   End     Divison
0   0   0100-0999   Agriculture, Forestry and Fishing   100     999     
1   1   1000-1499   Mining                              1000    1499    Agro
2   2   1500-1799   Construction                        1500    1799    
3   3   1800-1999   not used                            1800    1999    
4   4   2000-3999   Manufacturing                       2000    3999    
5   5   4000-4999   Transportation, Communications, Electric, Gas ...   4000    4999    Utils
6   6   5000-5199   Wholesale Trade                      5000   5199    Shops
7   7   5200-5999   Retail Trade                        5200    5999    
8   8   6000-6799   Finance, Insurance and Real Estate  6000    6799    
9   9   7000-8999   Services                            7000    8999    
10  10  9100-9729   Public Administration               9100    9729    
11  11  9900-9999   Nonclassifiable                     9900    9999    Utils