To explain the question better:
I have 2 dataframes:
DF1 (master):
CodeRange Sector Start End
0 0100-0999 Agriculture, Forestry and Fishing 0100 0999
1 1000-1499 Mining 1000 1499
2 1500-1799 Construction 1500 1799
3 1800-1999 not used 1800 1999
4 2000-3999 Manufacturing 2000 3999
5 4000-4999 Transportation, Communications, Electric, Gas ... 4000 4999
6 5000-5199 Wholesale Trade 5000 5199
7 5200-5999 Retail Trade 5200 5999
8 6000-6799 Finance, Insurance and Real Estate 6000 6799
9 7000-8999 Services 7000 8999
10 9100-9729 Public Administration 9100 9729
11 9900-9999 Nonclassifiable 9900 9999
and DF2:
SICCode Sector
0 1230 Agro
1 4974 Utils
2 5120 shops
3 9997 Utils
In DF1, I was able to split the "CodeRange" column values into 2 columns ("Start" and "End") and have converted them to int.
I basically want to check for each SICCode in DF2 to exist between which range and update the "Sector" value in DF2 with the corresponding value under "Division" column in DF1.
The final DF2 should look like:
DF2:
SICCode Sector
0 1230 Agriculture, Forestry and Fishing
1 4974 Transportation, Communication...
2 5120 Wholesale Trade
3 9997 Non-classifiable
CodePudding user response:
You can definitely optimise my solution by using masks I think, but you can achieve this with the following:
data = []
for i in range(len(df2)):
code = df2["SICCode"].iloc[i]
for j in range(len(df1)):
start = df1["Start"].iloc[j]
end = df1["End"].iloc[j]
if code >= start and code <= end:
data.append(df1["Sector"].iloc[j])
continue # to move to the next i
df2["Sector"] = data
CodePudding user response:
a little more compact solution, without loops
key is in creating the index 'start_idx' by dividing the numbers by 1000, to help us merge, subsquently, we check if the SICCode lies within the range, when it doesn't we make the division blank
df3= df.assign(start_idx=(df['Start']//1000).astype(int)).merge(
df2.assign(start_idx=(df2['SICCode']//1000).astype(int)), on='start_idx', how='left')
df3['Divison']=np.where( (df3['SICCode']> df3['Start']) &
( df3['SICCode']<=df3['End'] ), df3['Sector_y'], "")
df3.drop(columns=['start_idx','x_y','SICCode','Sector_y'])
x_x CodeRange Sector_x Start End Divison
0 0 0100-0999 Agriculture, Forestry and Fishing 100 999
1 1 1000-1499 Mining 1000 1499 Agro
2 2 1500-1799 Construction 1500 1799
3 3 1800-1999 not used 1800 1999
4 4 2000-3999 Manufacturing 2000 3999
5 5 4000-4999 Transportation, Communications, Electric, Gas ... 4000 4999 Utils
6 6 5000-5199 Wholesale Trade 5000 5199 Shops
7 7 5200-5999 Retail Trade 5200 5999
8 8 6000-6799 Finance, Insurance and Real Estate 6000 6799
9 9 7000-8999 Services 7000 8999
10 10 9100-9729 Public Administration 9100 9729
11 11 9900-9999 Nonclassifiable 9900 9999 Utils