I would like to reorder my dataframe based on certain conditions
My original dataframe looks like
Merchant name original_rnk
Boohoo 1
PRETTYLITTLETHING 2
ASOS US 3
PRINCESS POLLY 4
URBAN OUTFITTERS 5
KIM ONO 6
And there is a reference dataframe that has some merchant information
Merchant name order_cnt profit epc
Boohoo 200 30 0.6
PRETTYLITTLETHING 100 -60 -0.4
ASOS US 50 100 1.0
PRINCESS POLLY 80 50 0.8
URBAN OUTFITTERS 120 -20 -0.1
KIM ONO 500 90 0.7
I would like to give a new rank to these merchant based on their epc if their order_cnt >= 100 and profit >=0. The first merchant will always be the first no matter how much its order_cnt and profit are, but for the rest whose order_cnt <100 or profit <0, r their original order.
So my desired output is
Merchant name new_rnk original_rnk
Boohoo 1 1
PRETTYLITTLETHING 3 2
ASOS US 4 3
PRINCESS POLLY 5 4
URBAN OUTFITTERS 6 5
KIM ONO 2 6
CodePudding user response:
Using the data provided in the question:
info = pd.DataFrame({
'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO'],
'order_cnt': [200, 100, 50, 80, 120, 500],
'profit': [30, -60, 100, 50, -20, 90],
'epc': [0.6, -0.4, 1.0, 0.8, -0.1, 0.7]
})
Split the data into the first row (head), the rows that satisfy the condition (mask) and the rows that don't (pos and neg):
head = info.head(1)
tail = info.iloc[1:]
mask = tail.eval('order_cnt >= 100 and profit >= 0')
pos = tail[mask]
neg = tail[~mask]
Sort the positive rows using the desired criteria (epc) and concatenate the three partitions back together:
df = pd.concat([head, pos.sort_values('epc', ascending=False), neg])
To get output as presented in the original question (with both the original and the new ranks and sorted by the original rank) add these lines:
df['new_rank'] = range(1, 7)
df['original_rank'] = df['Merchant name'].map(ranks.set_index('Merchant name')['original_rnk'])
df.sort_values('original_rank')[['Merchant name', 'new_rank', 'original_rank']]
where ranks
is the "original data frame":
ranks = pd.DataFrame({
'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO'],
'original_rnk': range(1, 7)
})
CodePudding user response:
This can be done as follows:
- Initializing your first DataFrame with the name:
df
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Merchant name' : ['Boohoo','PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO', 'addedMerch'],
'original_rnk': [1, 2, 3, 4, 5, 6, 7]
})
-note I have added data of an additional merchant named: 'addedMerch' for making sure later the code works
- Initializing your second "reference" DataFrame with the name:
df2
df2 = pd.DataFrame({
'Merchant name' : ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO','addedMerch'],
'order_cnt': [200, 100, 50, 80, 120, 500, 1000],
'profit': [30, -60, 100, 50, -20, 90, 1000],
'epc': [0.6 ,-0.4, 1.0, 0.8, -0.1, 0.7,0.01]
})
- Renamed column label from
'Merchant name'
to'merchant_name'
in both DataFrames
df.rename(columns={'Merchant name': 'merchant_name'}, inplace=True)
df2.rename(columns={'Merchant name': 'merchant_name'}, inplace=True)
- Removing the first merchant temporarily from both DataFrames and storing his data to append it to the the DataFrames again after the operation
#storing it first
firstMerchant = df.query("original_rnk == 1")
firstMerchantInDF2 = df2.query("merchant_name == '{}'".format(firstMerchant.iloc[0]['merchant_name']))
#then removing it
df.drop(firstMerchant.index, axis="rows", inplace=True)
df2.drop(firstMerchantInDF2.index, axis="rows", inplace=True)
- Creating a new DataFrame for merchants that will have a newly assigned rank
merchantsWithNewRank = df2.query("order_cnt >= 100 & profit >= 0")
- Giving them their new rank according to their 'epc' value
#this is done by first, sorting them by 'epc'
merchantsWithNewRank.sort_values(by="epc", ascending=False)
# and then second, adding a new column: 'new_rnk', assigning it to a range
# starting from 2 (because 1 is for the 'untouched first merchant')
# all the way to (the amount of all merchants with new rank 2)
# -to make sure we are giving exactly each merchant his rank-
merchantsWithNewRank['new_rnk'] = np.arange(2,merchantsWithNewRank.shape[0] 2)
- Initialize the column 'new_rnk' in first DataFrame, holding original ranks but pushed n places (each increased by n), where n is the amount of merchants with new rank
df["new_rnk"] = df.original_rnk merchantsWithNewRank.shape[0]
- Assign new ranks for merchants that were qualified to hold a new rank in the original DataFrame (using DF:
'merchantsWithNewRank'
as our reference)
for index in merchantsWithNewRank.index:
df.loc[df.merchant_name==merchantsWithNewRank['merchant_name'][index], 'new_rnk'] = merchantsWithNewRank['new_rnk'][index]
- Bring back the first merchant's data and give him his new rank of
1
df = firstMerchant.append(df)
df.loc[df.original_rnk == 1, 'new_rnk'] = 1
- Bring him back to
df2
too
df2 = firstMerchantInDF2.append(df2)
The result for the above:
-row "addedMerch" is added for double checking the results are correct
[in]: print(df)
[out]:
merchant_name original_rnk new_rnk
0 Boohoo 1 1.0
1 PRETTYLITTLETHING 2 4.0
2 ASOS US 3 5.0
3 PRINCESS POLLY 4 6.0
4 URBAN OUTFITTERS 5 7.0
5 KIM ONO 6 2.0
6 addedMerch 7 3.0
Output for input you provided (without 'addedMerch'):
[in]: print(df)
[out]:
merchant_name original_rnk new_rnk
0 Boohoo 1 1.0
1 PRETTYLITTLETHING 2 3.0
2 ASOS US 3 4.0
3 PRINCESS POLLY 4 5.0
4 URBAN OUTFITTERS 5 6.0
5 KIM ONO 6 2.0
CodePudding user response:
You can use the following code for the desired output;
import pandas as pd
original_rank_frame = pd.DataFrame({'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO'],
'original_rnk': [1, 2, 3, 4, 5 ,6]})
reference_frame = pd.DataFrame({'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO'],
'order_cnt': [200, 100, 50, 80, 120, 500],
'profit': [30, -60, 100, 50, -20, 90],
'epc': [0.6, -0.4, 1.0, 0.8, -0.1, 0.7]})
final_table = pd.concat([reference_frame[((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0))],
reference_frame[~((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0))]], axis=0)
final_table = final_table.reset_index().rename({'index':'original_rnk'}, axis = 'columns').reset_index().rename({'index':'new_rnk'}, axis = 'columns')[['Merchant name', 'new_rnk', 'original_rnk']]
final_table[['new_rnk', 'original_rnk']] = 1
final_table.sort_values('original_rnk')
Output
Merchant name new_rnk original_rnk
0 Boohoo 1 1
2 PRETTYLITTLETHING 3 2
3 ASOS US 4 3
4 PRINCESS POLLY 5 4
5 URBAN OUTFITTERS 6 5
1 KIM ONO 2 6
Explanation
The first step is to filter the dataframe by desired qualities reference_frame[((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0)). Since these are mutually exclusive sets, we can use the negation (~) to get the rest.Then, we concat these two dataframes and extract the original index. We assign a new index by resetting. In the last step, we increment the index values since they start with 0 but the desired output from 1.