Pandas - Sorting by 2 columns and comparing the values in the other column-CodePudding

Please consider this data Frame:

pd.DataFrame({
'REGION':['US','US','CAN','CAN', 'EU','EU','EU'],
'ROLE': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'SALARY' : [4,5,3.7,6,4.1,5.5,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']

})

In this data frame, we have two regions, and in each region multiple employee roles. The salary column contains salary for that role in that region. Assume that all salary numbers are have the same currency.

Now, I would like to make sure that for any ROLE, the salary in CAN region must be at least as much as that in the US - and the salary in EU must be at least as much as that in CAN.

How do I solve it so that I get the following data frame?

pd.DataFrame({
'REGION':['US','US','CAN','CAN', 'EU','EU','EU'],
'ROLE': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'SALARY' : [4,5,4,6,4.1,6,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']

})

Please note that this is a sample data frame - in the real data frame, I have a few additional columns that I would like to keep unchanged. Thanks!

CodePudding user response：

import pandas as pd

data = pd.DataFrame({
'region':['US','US','CAN','CAN', 'EU','EU','EU'],
'role': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'salary' : [4,5,3.7,6,4.1,5.5,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']})

pt = pd.pivot_table(data, values=['salary'], index=['role'], columns=['region'])

df = pt['salary'].fillna(0)

df['CAN'] = df.apply(lambda x: max(x['US'], x['CAN']), axis=1)
df['EU'] = df.apply(lambda x: max(x['CAN'], x['EU']), axis=1)

data['salary'] = data.apply(lambda x: df[x['region']][x['role']], axis=1)

print(data)

CodePudding user response：

Solution by mapping and slicing in MultiIndex, for set values i use Series.clip:

df = df.set_index(['REGION','ROLE'])
df1 = df.copy()

us = df.loc['US', 'SALARY']
can = df.loc['CAN', 'SALARY']
eu = df.loc['EU', 'SALARY']

df.loc['CAN', 'SALARY'] = can.clip(lower=can.index.map(us)).to_numpy()
df.loc['EU', 'SALARY'] = eu.clip(lower=eu.index.map(can)).to_numpy()
df = df.fillna(df1).reset_index()
print (df)
  REGION ROLE  SALARY other_columns
0     US  mgr     4.0   random_val1
1     US  dir     5.0   random_val2
2    CAN  mgr     4.0   random_val3
3    CAN  dir     6.0   random_val4
4     EU  mgr     4.1   random_val5
5     EU  dir     6.0   random_val6
6     EU  CEO     8.0   random_val7

Another solution woth pivoting and unpivot:

df1 = df.pivot('ROLE','REGION','SALARY')
df1['CAN'] = df1[['CAN','US']].max(axis=1)
df1['EU'] = df1[['CAN','EU']].max(axis=1)
   
df = df.join(df1.stack().rename('new'), on=['ROLE','REGION'])
df['SALARY'] = df.pop('new')
print (df)
  REGION ROLE  SALARY other_columns
0     US  mgr     4.0   random_val1
1     US  dir     5.0   random_val2
2    CAN  mgr     4.0   random_val3
3    CAN  dir     6.0   random_val4
4     EU  mgr     4.1   random_val5
5     EU  dir     6.0   random_val6
6     EU  CEO     8.0   random_val7

CodePudding user response：

Another solution using groupby and cummax. I like this method because you can extend the number of regions you need to support relatively easily by adding additional regions to the custom sorting order.

df = pd.DataFrame({
'REGION':['US','US','CAN','CAN', 'EU','EU','EU'],
'ROLE': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'SALARY' : [4,5,3.7,6,4.1,5.5,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']})

# Replace the region with a categorical variable to ensure sorting order is US, CAN, EU
df["REGION"] = pd.Categorical(df["REGION"], ["US", "CAN", "EU"])
df = df.sort_values(["ROLE", "REGION"])
df = df.groupby("ROLE").apply(lambda x: x.assign(SALARY=x["SALARY"].cummax()))
# if you need your data in the original order again
df = df.sort_index()