Supposed, I have Pandas DataFrame looks like below:
Cluster | Variable | Group | Ratio | Value |
---|---|---|---|---|
1 | GDP_M3 | GDP | 20% | 70% |
1 | HPI_M6 | HPI | 40% | 80% |
1 | GDP_lg2 | GDP | 35% | 50% |
2 | CPI_M9 | CPI | 10% | 50% |
2 | HPI_lg6 | HPI | 15% | 65% |
3 | CPI_lg12 | CPI | 15% | 90% |
3 | CPI_lg1 | CPI | 20% | 95% |
I would like to rank Variable
based on Ratio
and Value
in the separated columns. The Ratio
will rank from the lowest to the highest, while the Value
will rank from the highest to the lowest.
There are some variables that I do not want to rank. In the example, I do not prefer CPI
. Any type of CPI
will not be considered for the rank e.g., CPI_M9
. However, the case will be expected only if there is only that particular variable in the Cluster
.
The results from condition above will look like the table below:
Cluster | Variable | Group | Ratio | Value | RankRatio | RankValue |
---|---|---|---|---|---|---|
1 | GDP_M3 | GDP | 20% | 70% | 1 | 2 |
1 | HPI_M6 | HPI | 40% | 80% | 3 | 1 |
1 | GDP_lg2 | GDP | 35% | 50% | 2 | 3 |
2 | CPI_M9 | CPI | 10% | 50% | NaN | NaN |
2 | HPI_lg6 | HPI | 15% | 65% | 1 | 1 |
3 | CPI_lg12 | CPI | 15% | 90% | 1 | 2 |
3 | CPI_lg1 | CPI | 20% | 95% | 2 | 1 |
For Cluster 1, the GDP_M3
has the lowest Ratio
at 20%
, while the HPI_M3
has the highest Value
at 80%
. Thus, both of them will be assigned rank 1 and the others will be followed subsequently.
For Cluster 2, even CPI_M9
has the lowest Ratio but the CPI
is not prefer. Thus, the rank 1 will be assigned to HPI_lg6
.
For Cluster 3, there are variables from the only CPI
Group and there is no other options to rank. Thus, the CPI_lg12
and CPI_lg1
are ranked based on the lowest Ratio
and the highest Value
.
df['RankRatio'] = df.groupby(['Cluster'])['Ratio'].rank(method = 'first', ascending = True)
df['RankValue'] = df.groupby(['Cluster'])['Value'].rank(method = 'first', ascending = False)
I have some code that can be handled only general case but for specific case with unprefer group of variables, my code cannot handle it.
Please help or suggest on this. Thank you.
CodePudding user response:
Use:
#convert columns to numeric
df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)
Remove row with CPI
by condition - test rows if no only CPI
per Cluster
:
m = df['Group'].eq('CPI')
m1 = ~df['Cluster'].isin(df.loc[m, 'Cluster']) | m
df['RankRatio'] = df[m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)
print (df)
Cluster Variable Group Ratio Value RankRatio RankValue
0 1 GDP_M3 GDP 20.0 70.0 1.0 2.0
1 1 HPI_M6 HPI 40.0 80.0 3.0 1.0
2 1 GDP_lg2 GDP 35.0 50.0 2.0 3.0
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN
4 2 HPI_lg6 HPI 15.0 65.0 1.0 1.0
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0
How it working:
For mask2
are filter all Cluster
values if match mask1
and filtered original column Cluster
, then invert mask by ~
. Last chain both conditions by |
for bitwise OR
for all rows without CPI
if exist with another values per Cluster
:
print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[m, 'Cluster']), both = m1))
Cluster Variable Group Ratio Value mask1 mask2 both
0 1 GDP_M3 GDP 20.0 70.0 False True True
1 1 HPI_M6 HPI 40.0 80.0 False True True
2 1 GDP_lg2 GDP 35.0 50.0 False True True
3 2 CPI_M9 CPI 10.0 50.0 True False True
4 2 HPI_lg6 HPI 15.0 65.0 False False False
5 3 CPI_lg12 CPI 15.0 90.0 True False True
6 3 CPI_lg1 CPI 20.0 95.0 True False True
EDIT:
df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)
m = df['Group'].isin(['CPI','HPI'])
m2 = df.groupby('Cluster')['Group'].transform('nunique').ne(1)
m1 = (~df['Cluster'].isin(df.loc[~m, 'Cluster']) | m) & m2
df['RankRatio'] = df[~m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[~m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)
print (df)
Cluster Variable Group Ratio Value RankRatio RankValue
0 1 GDP_M3 GDP 20.0 70.0 1.0 1.0
1 1 HPI_M6 HPI 40.0 80.0 NaN NaN
2 1 GDP_lg2 GDP 35.0 50.0 2.0 2.0
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN
4 2 HPI_lg6 HPI 15.0 65.0 NaN NaN
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0
print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[~m, 'Cluster']), m2=m2, all = ~m1))
Cluster Variable Group Ratio Value RankRatio RankValue mask1 mask2 \
0 1 GDP_M3 GDP 20.0 70.0 1.0 1.0 False False
1 1 HPI_M6 HPI 40.0 80.0 NaN NaN True False
2 1 GDP_lg2 GDP 35.0 50.0 2.0 2.0 False False
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN True True
4 2 HPI_lg6 HPI 15.0 65.0 NaN NaN True True
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0 True True
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0 True True
m2 all
0 True True
1 True False
2 True True
3 True False
4 True False
5 False True
6 False True